i am trying to use Beam to read a csv and send data to postgres.
But the pipeline is failing due to a conversion mismatch. note that this pipeline work when the 2 column are of type int and fail when the type of column contains a string.
here one of the things that i tried.
from past.builtins import unicode
ExampleRow = typing.NamedTuple('ExampleRow',[('id',int),('name',unicode)])
beam_df = (pipeline | 'Read CSV' >> beam.dataframe.io.read_csv('path.csv').with_output_types(ExampleRow))
beam_df2 = (convert.to_pcollection(beam_df) | beam.Map(print) |
WriteToJdbc(
table_name=table_name,
jdbc_url=jdbc_url,
driver_class_name = 'org.postgresql.Driver',
statement="insert into tablr values(?,?);",
username=username,
password=password,
)
)
result = pipeline.run()
result.wait_until_finish()
I tried also to add an urn to convert the str python type to varchar or unicode but this don't seems to work also
from apache_beam.typehints.schemas import LogicalType
#LogicalType.register_logical_type
class db_str(LogicalType):
#classmethod
def urn(cls):
return "beam:logical_type:javasdk:v1"
#classmethod
def language_type(cls):
return unicode
def to_language_type(self, value):
return unicode(value)
def to_representation_type(self, value):
return unicode(value)
ADD: this is the print result :
BeamSchema_f0d95d64_95c7_43ba_8a04_ac6a0b7352d9(id=21, nom='nom21')
BeamSchema_f0d95d64_95c7_43ba_8a04_ac6a0b7352d9(id=22, nom='nom22')
BeamSchema_f0d95d64_95c7_43ba_8a04_ac6a0b7352d9(id=21, nom='nom21')
BeamSchema_f0d95d64_95c7_43ba_8a04_ac6a0b7352d9(id=22, nom='nom22')
the problem comes from the WriteToJdbc function and the 'nom' column.
any idea how to make this work ?
I think your problem is due to you have BeamSchema structure instead of expected NamedTuple in your output PCollection.
Also, according to the documentation, a code instruction is missing in your example coders.registry.register_coder(ExampleRow, coders.RowCoder) :
apache_beam_io_jdbc doc
ExampleRow = typing.NamedTuple('ExampleRow',
[('id', int), ('name', unicode)])
coders.registry.register_coder(ExampleRow, coders.RowCoder)
with TestPipeline() as p:
_ = (
p
| beam.Create([ExampleRow(1, 'abc')])
.with_output_types(ExampleRow)
| 'Write to jdbc' >> WriteToJdbc(
driver_class_name='org.postgresql.Driver',
jdbc_url='jdbc:postgresql://localhost:5432/example',
username='postgres',
password='postgres',
statement='INSERT INTO example_table VALUES(?, ?)',
))
The following code in your snippet, doesn't set ExampleRow NamedTuple as expected, because your print indicates the type is BeamSchema :
beam_df = (pipeline | 'Read CSV' >> beam.dataframe.io.read_csv('path.csv').with_output_types(ExampleRow))
Try with this code before :
from past.builtins import unicode
ExampleRow = typing.NamedTuple('ExampleRow',[('id',int),('name',unicode)])
# Register coder here
coders.registry.register_coder(ExampleRow, coders.RowCoder)
beam_df = (pipeline | 'Read CSV' >> beam.dataframe.io.read_csv('path.csv').with_output_types(ExampleRow))
If it doesn't works, you need to find a way to transform your BeamSchema to ExampleRow NamedTuple in a DoFn :
def convert_beam_schema_to_named_tuple(beam_schema) -> ExampleRow :
# Your logic to transform Beam Schema to ExampleRow NamedTuple
from past.builtins import unicode
ExampleRow = typing.NamedTuple('ExampleRow',[('id',int),('name',unicode)])
beam_df = (pipeline | 'Read CSV' >> beam.dataframe.io.read_csv('path.csv').with_output_types(ExampleRow))
beam_df2 = (convert.to_pcollection(beam_df)
| beam.Map(print)
| beam.Map(convert_beam_schema_to_named_tuple)
| WriteToJdbc(
table_name=table_name,
jdbc_url=jdbc_url,
driver_class_name = 'org.postgresql.Driver',
statement="insert into tablr values(?,?);",
username=username,
password=password,
)
)
result = pipeline.run()
result.wait_until_finish()
At the end, if you have issue with Beam schema and Beam dataframe, you can read the CSV file directly with Beam IO and work with PCollection
You can check this link : csv-into-a-dictionary-in-apache-beam
Related
I have dataflow pipeline, it's in Python and this is what it is doing:
Read Message from PubSub. Messages are zipped protocol buffer. One Message receive on a PubSub contain multiple type of messages. See the protocol parent's message specification below:
message BatchEntryPoint {
/**
* EntryPoint
*
* Description: Encapsulation message
*/
message EntryPoint {
// Proto Message
google.protobuf.Any proto = 1;
// Timestamp
google.protobuf.Timestamp timestamp = 4;
}
// Array of EntryPoint messages
repeated EntryPoint entrypoints = 1;
}
So, to explain a bit better, I have several protobuf messages. Each message must be packed in the proto field of the EntryPoint message, we are sending several messages at once because of MQTT limitations, that's why we then use a repeated field pointing to EntryPoint message on BatchEntryPoint.
Parsing the received messages.
Nothing fancy here, just unzipping and unserializing the message we just read from the PubSub. to get 'humain readable' data.
For Loop on BatchEntryPoint to evaluate each EntryPoint messages.
As Each messages on BatchEntryPoint can have different type, we need to process them differently
Parsed message data
Doing different process to get all information I need and format it to a BigQuery readable format
Write data to bigQuery
This is where my 'trouble' begin, so my code work but it is very dirty in my opinion and hardly maintainable.
There is two things to be aware of.
Each message's type can be send to 3 different datasets, a r&d dataset, a dev dataset and a production dataset.
let's say I have a message named System.
It could go to:
my-project:rd_dataset.system
my-project:dev_dataset.system
my-project:prod_dataset.system
So this is what I am doing now:
console_records | 'Write to Console BQ' >> beam.io.WriteToBigQuery(
lambda e: 'my-project:rd_dataset.table1' if dataset_is_rd_table1(e) else (
'my-project:dev_dataset.table1' if dataset_is_dev_table1(e) else (
'my-project:prod_dataset.table1' if dataset_is_prod_table1(e) else (
'my-project:rd_dataset.table2' if dataset_is_rd_table2(e) else (
'my-project:dev_dataset.table2' if dataset_is_dev_table2(e) else (
...) else 0
I have more than 30 different type of messages, making more of 90 lines for inserting data to big query.
Here is what a dataset_is_..._tableX method looks like:
def dataset_is_rd_messagestype(element) -> bool:
""" check if env is rd for message's type message """
valid: bool = False
is_type = check_element_type(element, 'MessagesType')
if is_type:
valid = dataset_is_rd(element)
return valid
check_element_type Check that the message has the right type (ex: System).
dataset_is_rd looks like this:
def dataset_is_rd(element) -> bool:
""" Check if dataset should be RD from registry id """
if element['device_registry_id'] == 'rd':
del element['device_registry_id']
del element['bq_type']
return True
return False
The element as a key indicating us on which dataset we must send the message.
SO this is working as expected, But I wish I could do cleaner code and maybe reduce the amount of code to change in case of adding or deleting a type of message.
Any ideas?
How about using TaggedOutput.
Can you write something like this instead:
def dataset_type(element) -> bool:
""" Check if dataset should be RD from registry id """
dev_registry = element['device_registry_id']
del element['device_registry_id']
del element['bq_type']
table_type = get_element_type(element, 'MessagesType')
return 'my-project:%s_dataset.table%d' % (dev_registry, table_type)
And use that as the table lambda that you pass to BQ?
So I manage to create code to insert data to dynamic table by crafting the table name dynamically.
This is not perfect because I have to modify the element I pass to the method, however I am still very happy with the result, it has clean up my code from hundreds of line. If I have a new table, adding it would take one line on an array compare to 6 line in the pipeline before.
Here is my solution:
def batch_pipeline(pipeline):
console_message = (
pipeline
| 'Get console\'s message from pub/sub' >> beam.io.ReadFromPubSub(
subscription='sub1',
with_attributes=True)
)
common_message = (
pipeline
| 'Get common\'s message from pub/sub' >> beam.io.ReadFromPubSub(
subscription='sub2',
with_attributes=True)
)
jetson_message = (
pipeline
| 'Get jetson\'s message from pub/sub' >> beam.io.ReadFromPubSub(
subscription='sub3',
with_attributes=True)
)
message = (console_message, common_message, jetson_message) | beam.Flatten()
clear_message = message | beam.ParDo(GetClearMessage())
console_bytes = clear_message | beam.ParDo(SetBytesData())
console_bytes | 'Write to big query back up table' >> beam.io.WriteToBigQuery(
lambda e: write_to_backup(e)
)
records = clear_message | beam.ParDo(GetProtoData())
gps_records = clear_message | 'Get GPS Data' >> beam.ParDo(GetProtoData())
parsed_gps = gps_records | 'Parse GPS Data' >> beam.ParDo(ParseGps())
if parsed_gps:
parsed_gps | 'Write to big query gps table' >> beam.io.WriteToBigQuery(
lambda e: write_gps(e)
)
records | 'Write to big query table' >> beam.io.WriteToBigQuery(
lambda e: write_to_bq(e)
)
So the pipeline is reading from 3 different pub sub, extracting the data and writing to big query.
The structure of an element use by WriteToBigQuery looks like this:
obj = {
'data': data_to_write_on_bq,
'registry_id': data_needed_to_craft_table_name,
'gcloud_id': data_to_write_on_bq,
'proto_type': data_needed_to_craft_table_name
}
and then one of my method used on the lambda on WriteToBigQuery looks like this:
def write_to_bq(e):
logging.info(e)
element = copy(e)
registry = element['registry_id']
logging.info(registry)
dataset = set_dataset(registry) # set dataset name, knowing the registry, this is to set the environment (dev/prod/rd/...)
proto_type = element['proto_type']
logging.info('Proto Type %s', proto_type)
table_name = reduce(lambda x, y: x + ('_' if y.isupper() else '') + y, proto_type).lower()
full_table_name = f'my_project:{dataset}.{table_name}'
logging.info(full_table_name)
del e['registry_id']
del e['proto_type']
return full_table_name
And that's it, after 3 days of trouble !!
I am trying to create my first pipleine in dataflow, I have the same code runnign when i execute using the interactive beam runner but on dataflow I get all sort of errors, which are not making much sense to me.
I am getting json from pub sub which is of the following format.
{"timestamp":1589992571906,"lastPageVisited":"https://kickassdataprojects.com/simple-and-complete-tutorial-on-simple-linear-regression/","pageUrl":"https://kickassdataprojects.com/","pageTitle":"Helping%20companies%20and%20developers%20create%20awesome%20data%20projects%20%7C%20Data%20Engineering/%20Data%20Science%20Blog","eventType":"Pageview","landingPage":0,"referrer":"direct","uiud":"31af5f22-4cc4-48e0-9478-49787dd5a19f","sessionId":322371}
Here is the code of my pipeline.
from __future__ import absolute_import
import apache_beam as beam
#from apache_beam.runners.interactive import interactive_runner
#import apache_beam.runners.interactive.interactive_beam as ib
import google.auth
from datetime import timedelta
import json
from datetime import datetime
from apache_beam import window
from apache_beam.transforms.trigger import AfterWatermark, AfterProcessingTime, AccumulationMode, AfterCount
from apache_beam.options.pipeline_options import GoogleCloudOptions
from apache_beam.options.pipeline_options import PipelineOptions
from apache_beam.options.pipeline_options import SetupOptions
from apache_beam.options.pipeline_options import StandardOptions
import argparse
import logging
from time import mktime
def setTimestamp(elem):
from apache_beam import window
yield window.TimestampedValue(elem, elem['timestamp'])
def createTuples(elem):
yield (elem["sessionId"], elem)
class WriteToBigQuery(beam.PTransform):
"""Generate, format, and write BigQuery table row information."""
def __init__(self, table_name, dataset, schema, project):
"""Initializes the transform.
Args:
table_name: Name of the BigQuery table to use.
dataset: Name of the dataset to use.
schema: Dictionary in the format {'column_name': 'bigquery_type'}
project: Name of the Cloud project containing BigQuery table.
"""
# TODO(BEAM-6158): Revert the workaround once we can pickle super() on py3.
#super(WriteToBigQuery, self).__init__()
beam.PTransform.__init__(self)
self.table_name = table_name
self.dataset = dataset
self.schema = schema
self.project = project
def get_schema(self):
"""Build the output table schema."""
return ', '.join('%s:%s' % (col, self.schema[col]) for col in self.schema)
def expand(self, pcoll):
return (
pcoll
| 'ConvertToRow' >>
beam.Map(lambda elem: {col: elem[col]
for col in self.schema})
| beam.io.WriteToBigQuery(
self.table_name, self.dataset, self.project, self.get_schema()))
class ParseSessionEventFn(beam.DoFn):
"""Parses the raw game event info into a Python dictionary.
Each event line has the following format:
username,teamname,score,timestamp_in_ms,readable_time
e.g.:
user2_AsparagusPig,AsparagusPig,10,1445230923951,2015-11-02 09:09:28.224
The human-readable time string is not used here.
"""
def __init__(self):
# TODO(BEAM-6158): Revert the workaround once we can pickle super() on py3.
#super(ParseSessionEventFn, self).__init__()
beam.DoFn.__init__(self)
def process(self, elem):
#timestamp = mktime(datetime.strptime(elem["timestamp"], "%Y-%m-%d %H:%M:%S").utctimetuple())
elem['sessionId'] = int(elem['sessionId'])
elem['landingPage'] = int(elem['landingPage'])
yield elem
class AnalyzeSessions(beam.DoFn):
"""Parses the raw game event info into a Python dictionary.
Each event line has the following format:
username,teamname,score,timestamp_in_ms,readable_time
e.g.:
user2_AsparagusPig,AsparagusPig,10,1445230923951,2015-11-02 09:09:28.224
The human-readable time string is not used here.
"""
def __init__(self):
# TODO(BEAM-6158): Revert the workaround once we can pickle super() on py3.
#super(AnalyzeSessions, self).__init__()
beam.DoFn.__init__(self)
def process(self, elem, window=beam.DoFn.WindowParam):
sessionId = elem[0]
uiud = elem[1][0]["uiud"]
count_of_events = 0
pageUrl = []
window_end = window.end.to_utc_datetime()
window_start = window.start.to_utc_datetime()
session_duration = window_end - window_start
for rows in elem[1]:
if rows["landingPage"] == 1:
referrer = rows["refererr"]
pageUrl.append(rows["pageUrl"])
return {
"pageUrl":pageUrl,
"eventType":"pageview",
"uiud":uiud,
"sessionId":sessionId,
"session_duration": session_duration,
"window_start" : window_start
}
def run(argv=None, save_main_session=True):
parser = argparse.ArgumentParser()
parser.add_argument('--topic', type=str, help='Pub/Sub topic to read from')
parser.add_argument(
'--subscription', type=str, help='Pub/Sub subscription to read from')
parser.add_argument(
'--dataset',
type=str,
required=True,
help='BigQuery Dataset to write tables to. '
'Must already exist.')
parser.add_argument(
'--table_name',
type=str,
default='game_stats',
help='The BigQuery table name. Should not already exist.')
parser.add_argument(
'--fixed_window_duration',
type=int,
default=60,
help='Numeric value of fixed window duration for user '
'analysis, in minutes')
parser.add_argument(
'--session_gap',
type=int,
default=5,
help='Numeric value of gap between user sessions, '
'in minutes')
parser.add_argument(
'--user_activity_window_duration',
type=int,
default=30,
help='Numeric value of fixed window for finding mean of '
'user session duration, in minutes')
args, pipeline_args = parser.parse_known_args(argv)
session_gap = args.session_gap * 60
options = PipelineOptions(pipeline_args)
# Set the pipeline mode to stream the data from Pub/Sub.
options.view_as(StandardOptions).streaming = True
options.view_as( StandardOptions).runner= 'DataflowRunner'
options.view_as(SetupOptions).save_main_session = save_main_session
p = beam.Pipeline(options=options)
lines = (p
| beam.io.ReadFromPubSub(
subscription="projects/phrasal-bond-274216/subscriptions/rrrr")
| 'decode' >> beam.Map(lambda x: x.decode('utf-8'))
| beam.Map(lambda x: json.loads(x))
| beam.ParDo(ParseSessionEventFn())
)
next = ( lines
| 'AddEventTimestamps' >> beam.Map(setTimestamp)
| 'Create Tuples' >> beam.Map(createTuples)
| beam.Map(print)
| 'Window' >> beam.WindowInto(window.Sessions(15))
| 'group by key' >> beam.GroupByKey()
| 'analyze sessions' >> beam.ParDo(AnalyzeSessions())
| 'WriteTeamScoreSums' >> WriteToBigQuery(
args.table_name,
{
"uiud":'STRING',
"session_duration": 'INTEGER',
"window_start" : 'TIMESTAMP'
},
options.view_as(GoogleCloudOptions).project)
)
next1 = ( next
| 'Create Tuples' >> beam.Map(createTuples)
| beam.Map(print)
)
result = p.run()
# result.wait_till_termination()
if __name__ == '__main__':
logging.getLogger().setLevel(logging.INFO)
run()
In the following code, I get the following error 'generator' object is not subscriptable, when I try to create tuples in my pipeline. I get it using yield is creating the generator object, even return doesn't work it just beaks my pipeline.
apache_beam.coders.coder_impl.SequenceCoderImpl.get_estimated_size_and_observables File "sessiontest1.py", line 23, in createTuples TypeError: 'generator' object is not subscriptable [while running 'generatedPtransform-148']
Here is the code I use to execute the pipeline.
python3 sessiontest1.py --project phrasal-bond-xxxxx --region us-central1 --subscription projects/phrasal-bond-xxxxx/s
ubscriptions/xxxxxx --dataset sessions_beam --runner DataflowRunner --temp_location gs://webevents/sessions --service_account_email-xxxxxxxx-
compute#developer.gserviceaccount.com
Any help on this would be appreciated. Thanks guys, again first time working on dataflow, so not sure what I am missing here.
Other errors I was getting before that are sorted now:-
a) I get the error that widow is not defined from the line name beam.Map(lambda elem: window.TimestampedValue(elem, elem['timestamp'])) .
If I go beam.window then it says beam is not defined, according to me beam should be provided by dataflow,
NameError: name 'window' is not defined [while running 'generatedPtransform-3820']
You just need to import the modules in the function itself.
Getting a 'generator' object is not subscriptable error on createTuples indicates that when you try to do elem["sessionID"], the elem is already a generator. The previous transform you do is setTimestamp, which is also using yield and therefore outputting a generator that gets passed as the element to createTuples.
The solution here is to implement setTimestamp and createTuples with return instead of yield. Return the element you want to receive in the following transform.
You should set save_main_session = True in your code. ( try to uncomment that line in your code). See more about NameError here : https://cloud.google.com/dataflow/docs/resources/faq
I am trying to merge 2 JSON inputs (this example is from a file, but it will be from a Google Pub Sub input later) from these:
orderID.json:
{"orderID":"test1","orderPacked":"Yes","orderSubmitted":"Yes","orderVerified":"Yes","stage":1}
combined.json:
{"barcode":"95590","name":"Ash","quantity":6,"orderID":"test1"}
{"barcode":"95591","name":"Beat","quantity":6,"orderID":"test1"}
{"barcode":"95592","name":"Cat","quantity":6,"orderID":"test1"}
{"barcode":"95593","name":"Dog","quantity":6,"orderID":"test2"}
{"barcode":"95594","name":"Scar","quantity":6,"orderID":"test2"}
To something like this (using orderID as the unique and primary key):
output.json:
{"orderID":"test1","orderPacked":"Yes","orderSubmitted":"Yes","orderVerified":"Yes","stage":1,"barcode":"95590","name":"Ash","quantity":6}
{"orderID":"test1","orderPacked":"Yes","orderSubmitted":"Yes","orderVerified":"Yes","stage":1,"barcode":"95591","name":"Beat","quantity":6}
{"orderID":"test1","orderPacked":"Yes","orderSubmitted":"Yes","orderVerified":"Yes","stage":1,"barcode":"95592","name":"Cat","quantity":6}
I have my codes like this now which was adapted from join two json in Google Cloud Platform with dataflow
from __future__ import absolute_import
import argparse
import apache_beam as beam
import json
from apache_beam.options.pipeline_options import PipelineOptions
from apache_beam.options.pipeline_options import SetupOptions
from apache_beam.options.pipeline_options import StandardOptions
from google.api_core import datetime_helpers
from google.api_core.exceptions import InternalServerError
from google.api_core.exceptions import ServiceUnavailable
from google.api_core.exceptions import TooManyRequests
from google.cloud import bigquery
def run(argv=None):
"""Build and run the pipeline."""
parser = argparse.ArgumentParser()
parser.add_argument(
'--topic',
type=str,
help='Pub/Sub topic to read from')
parser.add_argument(
'--topic2',
type=str,
help='Pub/Sub topic to match with'
)
parser.add_argument(
'--output',
help=('Output local filename'))
args, pipeline_args = parser.parse_known_args(argv)
options = PipelineOptions(pipeline_args)
options.view_as(SetupOptions).save_main_session = True
options.view_as(StandardOptions).streaming = True
p = beam.Pipeline(options=options)
orderID = (p | 'read from text1' >> beam.io.ReadFromText('orderID.json')
#'Read from orderID PubSub' >> beam.io.ReadFromPubSub(topic=args.topic2)
| 'Parse JSON to Dict' >> beam.Map(lambda e: json.loads(e))
| 'key_orderID' >> beam.Map(lambda orders: (orders['orderID'], orders))
)
orders_si = beam.pvalue.AsDict(orderID)
orderDetails = (p | 'read from text' >> beam.io.ReadFromText('combined.json')
| 'Parse JSON to Dict1' >> beam.Map(lambda e: json.loads(e)))
#'Read from PubSub' >> beam.io.ReadFromPubSub(topic=args.topic))
def join_orderID_orderDetails(order, order_dict):
return order.update(order_dict[order['orderID']])
joined_dicts = orderDetails | beam.Map(join_orderID_orderDetails, order_dict=orders_si)
joined_dicts | beam.io.WriteToText('beam.output')
p.run()
#result.wait_until_finish()
if __name__ == '__main__':
run()
But my output now in beam.output just shows:
None
None
None
Can someone point out to me what I am doing wrong about this ?
The question that is different from the reported duplicate post is:
Why are my results "None"?
What am I doing wrong here?
I suspect these are the issues:
"order" variable - is that correctly referenced in "join_orderID_orderDetails"
List item "join_orderID_orderDetails" in "join_dicts? - is that correctly referneced too?
Try the below, Hope this will help you a little.
Here i have used an array of your order and combined, instead of using a file.
order = [{"orderID":"test1","orderPacked":"Yes","orderSubmitted":"Yes","orderVerified":"Yes","stage":1}]
combined = [
{"barcode":"95590","name":"Ash","quantity":6,"orderID":"test1"},
{"barcode":"95591","name":"Beat","quantity":6,"orderID":"test1"},
{"barcode":"95592","name":"Cat","quantity":6,"orderID":"test1"},
{"barcode":"95593","name":"Dog","quantity":6,"orderID":"test2"},
{"barcode":"95594","name":"Scar","quantity":6,"orderID":"test2"}
]
def joinjson(repl, tobeCombined):
newarr = []
for data in tobeCombined:
replData = getOrderData(repl,data['orderID'])
if replData is not None:
data.update(replData)
newarr.append(data)
return newarr
def getOrderData(order, orderID):
for data in order:
print("Data OrderID : ",data['orderID'])
if data['orderID'] == orderID:
return data
print(joinjson(order,combined))
I am trying to follow the design pattern for Slowly Changing Lookup Cache (https://cloud.google.com/blog/products/gcp/guide-to-common-cloud-dataflow-use-case-patterns-part-1) for a streaming pipeline using the Python SDK for Apache Beam on DataFlow.
Our reference table for the lookup cache sits in BigQuery, and we are able to read and pass it in as a Side Input to the ParDo operation but it does not refresh regardless of how we set up the trigger/windows.
class FilterAlertDoFn(beam.DoFn):
def process(self, element, alertlist):
print len(alertlist)
print alertlist
… # function logic
alert_input = (p | beam.io.Read(beam.io.BigQuerySource(query=ALERT_QUERY))
| ‘alert_side_input’ >> beam.WindowInto(
beam.window.GlobalWindows(),
trigger=trigger.RepeatedlyTrigger(trigger.AfterWatermark(
late=trigger.AfterCount(1)
)),
accumulation_mode=trigger.AccumulationMode.ACCUMULATING
)
| beam.Map(lambda elem: elem[‘SOMEKEY’])
)
...
main_input | ‘alerts’ >> beam.ParDo(FilterAlertDoFn(), beam.pvalue.AsList(alert_input))
Based on the I/O page here (https://beam.apache.org/documentation/io/built-in/) it says Python SDK supports streaming for the BigQuery Sink only, does that mean that BQ reads are a bounded source and therefore can’t be refreshed in this method?
Trying to set non-global windows on the source results in an empty PCollection in the Side Input.
UPDATE:
When trying to implement the strategy suggested by Pablo's answer, the ParDo operation that uses the side input wont run.
There is a single input source that goes to two output's, one of then using the Side Input. The Non-SideInput will still reach it's destination and the SideInput pipeline wont enter the FilterAlertDoFn().
By substituting the side input for a dummy value the pipeline will enter the function. Is it perhaps waiting for a suitable window that doesn't exist?
With the same FilterAlertDoFn() as above, my side_input and call now look like this:
def refresh_side_input(_):
query = 'select col from table'
client = bigquery.Client(project='gcp-project')
query_job = client.query(query)
return query_job.result()
trigger_input = ( p | 'alert_ref_trigger' >> beam.io.ReadFromPubSub(
subscription=known_args.trigger_subscription))
bigquery_side_input = beam.pvalue.AsSingleton((trigger_input
| beam.WindowInto(beam.window.GlobalWindows(),
trigger=trigger.Repeatedly(trigger.AfterCount(1)),
accumulation_mode=trigger.AccumulationMode.DISCARDING)
| beam.Map(refresh_side_input)
))
...
# Passing this as side input doesn't work
main_input | 'alerts' >> beam.ParDo(FilterAlertDoFn(), bigquery_side_input)
# Passing dummy variable as side input does work
main_input | 'alerts' >> beam.ParDo(FilterAlertDoFn(), [1])
I tried a few different versions of refresh_side_input(), They report the expect result when checking the return inside the function.
UPDATE 2:
I made some minor modifications to Pablo's code, and I get the same behaviour - the DoFn never executes.
In the below example I will see 'in_load_conversion_data' whenever I post to some_other_topic but will never see 'in_DoFn' when posting to some_topic
import apache_beam as beam
import apache_beam.transforms.window as window
from apache_beam.transforms import trigger
from apache_beam.options.pipeline_options import PipelineOptions
from apache_beam.options.pipeline_options import SetupOptions
from apache_beam.options.pipeline_options import StandardOptions
def load_my_conversion_data():
return {'EURUSD': 1.1, 'USDMXN': 4.4}
def load_conversion_data(_):
# I will suppose that these are currency conversions. E.g.
# {'EURUSD': 1.1, 'USDMXN' 20,}
print 'in_load_conversion_data'
return load_my_conversion_data()
class ConvertTo(beam.DoFn):
def __init__(self, target_currency):
self.target_currency = target_currency
def process(self, elm, rates):
print 'in_DoFn'
elm = elm.attributes
if elm['currency'] == self.target_currency:
yield elm
elif ' % s % s' % (elm['currency'], self.target_currency) in rates:
rate = rates[' % s % s' % (elm['currency'], self.target_currency)]
result = {}.update(elm).update({'currency': self.target_currency,
'value': elm['value']*rate})
yield result
else:
return # We drop that value
pipeline_options = PipelineOptions()
pipeline_options.view_as(StandardOptions).streaming = True
p = beam.Pipeline(options=pipeline_options)
some_topic = 'projects/some_project/topics/some_topic'
some_other_topic = 'projects/some_project/topics/some_other_topic'
with beam.Pipeline(options=pipeline_options) as p:
table_pcv = beam.pvalue.AsSingleton((
p
| 'some_other_topic' >> beam.io.ReadFromPubSub(topic=some_other_topic, with_attributes=True)
| 'some_other_window' >> beam.WindowInto(window.GlobalWindows(),
trigger=trigger.Repeatedly(trigger.AfterCount(1)),
accumulation_mode=trigger.AccumulationMode.DISCARDING)
| beam.Map(load_conversion_data)))
_ = (p | 'some_topic' >> beam.io.ReadFromPubSub(topic=some_topic)
| 'some_window' >> beam.WindowInto(window.FixedWindows(1))
| beam.ParDo(ConvertTo('USD'), rates=table_pcv))
As you well point out, the Java SDK allows you to use more streaming utilites such as timers and state. These utilities help the implementation of pipelines like these.
The Python SDK lacks some of these utilities, and specifically timers. For that reason, we need to use a hack, where the reload of the side input can be triggered by inserting messages into our some_other_topic in PubSub.
This also means that you have to manually perform a lookup into BigQuery. You can probably use the apache_beam.io.gcp.bigquery_tools.BigQueryWrapper class to perform lookups directly into BigQuery.
Here is an example of a pipeline that refreshes some currency conversion data. I haven't tested it, but I'm 90% sure it'll work with only few adjustments. Let me know if this helps.
pipeline_options = PipelineOptions()
p = beam.Pipeline(options=pipeline_options)
def load_conversion_data(_):
# I will suppose that these are currency conversions. E.g.
# {‘EURUSD’: 1.1, ‘USDMXN’ 20, …}
return external_service.load_my_conversion_data()
table_pcv = beam.pvalue.AsSingleton((
p
| beam.io.gcp.ReadFromPubSub(topic=some_other_topic)
| WindowInto(window.GlobalWindow(),
trigger=trigger.Repeatedly(trigger.AfterCount(1),
accumulation_mode=trigger.AccumulationMode.DISCARDING)
| beam.Map(load_conversion_data)))
class ConvertTo(beam.DoFn):
def __init__(self, target_currency):
self.target_currenct = target_currency
def process(self, elm, rates):
if elm[‘currency’] == self.target_currency:
yield elm
elif ‘%s%s’ % (elm[‘currency’], self.target_currency) in rates:
rate = rates[‘%s%s’ % (elm[‘currency’], self.target_currency)]
result = {}.update(elm).update({‘currency’: self.target_currency,
‘value’: elm[‘value’]*rate})
yield result
else:
return # We drop that value
_ = (p
| beam.io.gcp.ReadFromPubSub(topic=some_topic)
| beam.WindowInto(window.FixedWindows(1))
| beam.ParDo(ConvertTo(‘USD’), rates=table_pcv))
Current situation
The porpouse of this pipeline is to read from pub/sub the payload with geodata, then this data are transformed and analyzed and finally return if a condition is true or false
with beam.Pipeline(options=pipeline_options) as p:
raw_data = (p
| 'Read from PubSub' >> beam.io.ReadFromPubSub(
subscription='projects/XXX/subscriptions/YYY'))
geo_data = (raw_data
| 'Geo data transform' >> beam.Map(lambda s: GeoDataIngestion(s)))
def GeoDataIngestion(string_input):
<...>
return True or False
Desirable situation 1
If the GeoDataIngestion result is true, then the raw_data will be stored in big query
geo_data = (raw_data
| 'Geo data transform' >> beam.Map(lambda s: GeoDataIngestion(s))
| 'Evaluate condition' >> beam.Map(lambda s: Condition(s))
)
def Condition(condition):
if condition:
<...WriteToBigQuery...>
#The class I used before to store raw_data without depending on evaluate condition:
class WriteToBigQuery(beam.PTransform):
def expand(self, pcoll):
return (
pcoll
| 'Format' >> beam.ParDo(FormatBigQueryFn())
| 'Write to BigQuery' >> beam.io.WriteToBigQuery(
'XXX',
schema=TABLE_SCHEMA,
create_disposition=beam.io.BigQueryDisposition.CREATE_IF_NEEDED,
write_disposition=beam.io.BigQueryDisposition.WRITE_APPEND))
Desirable situation 2
Instead of store the data in BigQuery, it would be also good to send to pub/sub
def Condition(condition):
if condition:
<...SendToPubSub(Topic1)...>
else:
<...SendToPubSub(Topic2)...>
Here, the problem is to set the Topic depending of the condition result, because i'm not able to pass the topic like parameter in the pipeline
| beam.io.WriteStringsToPubSub(TOPIC)
Neither in a function/class
Question
How can I do that?
How/where should I call WriteToBigQuery to store the PCollection raw_data if the result of Evaluate condition is true?
I think branching collections based on the evaluation condition result might be helpful for your scenario. Please see the documentation here.
To illustrate the branching, suppose I have a collection below, where you want to do different action based on the content of the string.
'this line is for BigQuery',
'this line for pubsub topic1',
'this line for pubsub topic2'
The code below will create tag the collection and you can get three different PCollections based on the tag. Then you can decide what further actions you want to perform on the individual collections.
import apache_beam as beam
from apache_beam import pvalue
import sys
class Split(beam.DoFn):
# These tags will be used to tag the outputs of this DoFn.
OUTPUT_TAG_BQ = 'BigQuery'
OUTPUT_TAG_PS1 = 'pubsub topic1'
OUTPUT_TAG_PS2 = 'pubsub topic2'
def process(self, element):
"""
tags the input as it processes the orignal PCollection
"""
print element
if "BigQuery" in element:
yield pvalue.TaggedOutput(self.OUTPUT_TAG_BQ, element)
print 'found bq'
elif "pubsub topic1" in element:
yield pvalue.TaggedOutput(self.OUTPUT_TAG_PS1, element)
elif "pubsub topic2" in element:
yield pvalue.TaggedOutput(self.OUTPUT_TAG_PS2, element)
if __name__ == '__main__':
output_prefix = 'C:\\pythonVirtual\\Mycodes\\output'
p = beam.Pipeline(argv=sys.argv)
lines = (p
| beam.Create([
'this line is for BigQuery',
'this line for pubsub topic1',
'this line for pubsub topic2']))
# with_outputs allows accessing the explicitly tagged outputs of a DoFn.
tagged_lines_result = (lines
| beam.ParDo(Split()).with_outputs(
Split.OUTPUT_TAG_BQ,
Split.OUTPUT_TAG_PS1,
Split.OUTPUT_TAG_PS2))
# tagged_lines_result is an object of type DoOutputsTuple. It supports
# accessing result in alternative ways.
bq_records = tagged_lines_result[Split.OUTPUT_TAG_BQ]| "write BQ" >> beam.io.WriteToText(output_prefix + 'bq')
ps1_records = tagged_lines_result[Split.OUTPUT_TAG_PS1] | "write PS1" >> beam.io.WriteToText(output_prefix + 'ps1')
ps2_records = tagged_lines_result[Split.OUTPUT_TAG_PS2] | "write PS2" >> beam.io.WriteToText(output_prefix + 'ps2')
p.run().wait_until_finish()
Please let me know if that helps.