I am trying to follow the design pattern for Slowly Changing Lookup Cache (https://cloud.google.com/blog/products/gcp/guide-to-common-cloud-dataflow-use-case-patterns-part-1) for a streaming pipeline using the Python SDK for Apache Beam on DataFlow.
Our reference table for the lookup cache sits in BigQuery, and we are able to read and pass it in as a Side Input to the ParDo operation but it does not refresh regardless of how we set up the trigger/windows.
class FilterAlertDoFn(beam.DoFn):
def process(self, element, alertlist):
print len(alertlist)
print alertlist
… # function logic
alert_input = (p | beam.io.Read(beam.io.BigQuerySource(query=ALERT_QUERY))
| ‘alert_side_input’ >> beam.WindowInto(
beam.window.GlobalWindows(),
trigger=trigger.RepeatedlyTrigger(trigger.AfterWatermark(
late=trigger.AfterCount(1)
)),
accumulation_mode=trigger.AccumulationMode.ACCUMULATING
)
| beam.Map(lambda elem: elem[‘SOMEKEY’])
)
...
main_input | ‘alerts’ >> beam.ParDo(FilterAlertDoFn(), beam.pvalue.AsList(alert_input))
Based on the I/O page here (https://beam.apache.org/documentation/io/built-in/) it says Python SDK supports streaming for the BigQuery Sink only, does that mean that BQ reads are a bounded source and therefore can’t be refreshed in this method?
Trying to set non-global windows on the source results in an empty PCollection in the Side Input.
UPDATE:
When trying to implement the strategy suggested by Pablo's answer, the ParDo operation that uses the side input wont run.
There is a single input source that goes to two output's, one of then using the Side Input. The Non-SideInput will still reach it's destination and the SideInput pipeline wont enter the FilterAlertDoFn().
By substituting the side input for a dummy value the pipeline will enter the function. Is it perhaps waiting for a suitable window that doesn't exist?
With the same FilterAlertDoFn() as above, my side_input and call now look like this:
def refresh_side_input(_):
query = 'select col from table'
client = bigquery.Client(project='gcp-project')
query_job = client.query(query)
return query_job.result()
trigger_input = ( p | 'alert_ref_trigger' >> beam.io.ReadFromPubSub(
subscription=known_args.trigger_subscription))
bigquery_side_input = beam.pvalue.AsSingleton((trigger_input
| beam.WindowInto(beam.window.GlobalWindows(),
trigger=trigger.Repeatedly(trigger.AfterCount(1)),
accumulation_mode=trigger.AccumulationMode.DISCARDING)
| beam.Map(refresh_side_input)
))
...
# Passing this as side input doesn't work
main_input | 'alerts' >> beam.ParDo(FilterAlertDoFn(), bigquery_side_input)
# Passing dummy variable as side input does work
main_input | 'alerts' >> beam.ParDo(FilterAlertDoFn(), [1])
I tried a few different versions of refresh_side_input(), They report the expect result when checking the return inside the function.
UPDATE 2:
I made some minor modifications to Pablo's code, and I get the same behaviour - the DoFn never executes.
In the below example I will see 'in_load_conversion_data' whenever I post to some_other_topic but will never see 'in_DoFn' when posting to some_topic
import apache_beam as beam
import apache_beam.transforms.window as window
from apache_beam.transforms import trigger
from apache_beam.options.pipeline_options import PipelineOptions
from apache_beam.options.pipeline_options import SetupOptions
from apache_beam.options.pipeline_options import StandardOptions
def load_my_conversion_data():
return {'EURUSD': 1.1, 'USDMXN': 4.4}
def load_conversion_data(_):
# I will suppose that these are currency conversions. E.g.
# {'EURUSD': 1.1, 'USDMXN' 20,}
print 'in_load_conversion_data'
return load_my_conversion_data()
class ConvertTo(beam.DoFn):
def __init__(self, target_currency):
self.target_currency = target_currency
def process(self, elm, rates):
print 'in_DoFn'
elm = elm.attributes
if elm['currency'] == self.target_currency:
yield elm
elif ' % s % s' % (elm['currency'], self.target_currency) in rates:
rate = rates[' % s % s' % (elm['currency'], self.target_currency)]
result = {}.update(elm).update({'currency': self.target_currency,
'value': elm['value']*rate})
yield result
else:
return # We drop that value
pipeline_options = PipelineOptions()
pipeline_options.view_as(StandardOptions).streaming = True
p = beam.Pipeline(options=pipeline_options)
some_topic = 'projects/some_project/topics/some_topic'
some_other_topic = 'projects/some_project/topics/some_other_topic'
with beam.Pipeline(options=pipeline_options) as p:
table_pcv = beam.pvalue.AsSingleton((
p
| 'some_other_topic' >> beam.io.ReadFromPubSub(topic=some_other_topic, with_attributes=True)
| 'some_other_window' >> beam.WindowInto(window.GlobalWindows(),
trigger=trigger.Repeatedly(trigger.AfterCount(1)),
accumulation_mode=trigger.AccumulationMode.DISCARDING)
| beam.Map(load_conversion_data)))
_ = (p | 'some_topic' >> beam.io.ReadFromPubSub(topic=some_topic)
| 'some_window' >> beam.WindowInto(window.FixedWindows(1))
| beam.ParDo(ConvertTo('USD'), rates=table_pcv))
As you well point out, the Java SDK allows you to use more streaming utilites such as timers and state. These utilities help the implementation of pipelines like these.
The Python SDK lacks some of these utilities, and specifically timers. For that reason, we need to use a hack, where the reload of the side input can be triggered by inserting messages into our some_other_topic in PubSub.
This also means that you have to manually perform a lookup into BigQuery. You can probably use the apache_beam.io.gcp.bigquery_tools.BigQueryWrapper class to perform lookups directly into BigQuery.
Here is an example of a pipeline that refreshes some currency conversion data. I haven't tested it, but I'm 90% sure it'll work with only few adjustments. Let me know if this helps.
pipeline_options = PipelineOptions()
p = beam.Pipeline(options=pipeline_options)
def load_conversion_data(_):
# I will suppose that these are currency conversions. E.g.
# {‘EURUSD’: 1.1, ‘USDMXN’ 20, …}
return external_service.load_my_conversion_data()
table_pcv = beam.pvalue.AsSingleton((
p
| beam.io.gcp.ReadFromPubSub(topic=some_other_topic)
| WindowInto(window.GlobalWindow(),
trigger=trigger.Repeatedly(trigger.AfterCount(1),
accumulation_mode=trigger.AccumulationMode.DISCARDING)
| beam.Map(load_conversion_data)))
class ConvertTo(beam.DoFn):
def __init__(self, target_currency):
self.target_currenct = target_currency
def process(self, elm, rates):
if elm[‘currency’] == self.target_currency:
yield elm
elif ‘%s%s’ % (elm[‘currency’], self.target_currency) in rates:
rate = rates[‘%s%s’ % (elm[‘currency’], self.target_currency)]
result = {}.update(elm).update({‘currency’: self.target_currency,
‘value’: elm[‘value’]*rate})
yield result
else:
return # We drop that value
_ = (p
| beam.io.gcp.ReadFromPubSub(topic=some_topic)
| beam.WindowInto(window.FixedWindows(1))
| beam.ParDo(ConvertTo(‘USD’), rates=table_pcv))
Related
I am trying to start a simple batch ETL process on Dataflow for learning purposes. This is the logic I have performed:
Cloud Storage > PubSub > Cloud Function > DataFlow > Cloud Storage
A PubSub topic publish a message whenever a new file is uploaded to a bucket. Then, a CloudFunction listen a subscription on that topic, and starts a DataFlow job reading the file, performing the processing of the data and saving it to a new file on that same bucket.
I have been able to perform all the logic, however I am struggling with starting the Dataflow job through the CloudFunction instance. My function starts the job without any problem, but after a minutes the worker shows the following error message:
Error message from worker: Traceback (most recent call last): File "/usr/local/lib/python3.7/site-packages/
dataflow_worker/batchworker.py", line 773, in run self._load_main_session(self.local_staging_directory) File
"/usr/local/lib/python3.7/site-packages/dataflow_worker/batchworker.py", line 514, in _load_main_session
pickler.load_session(session_file) File "/usr/local/lib/python3.7/site-packages/apache_beam/internal/
pickler.py", line 311, in load_session return dill.load_session(file_path) File "/usr/local/lib/python3.7/
site-packages/dill/_dill.py", line 368, in load_session module = unpickler.load() File "/usr/local/lib/
python3.7/site-packages/dill/_dill.py", line 472, in load obj = StockUnpickler.load(self) File "/usr/local/
lib/python3.7/site-packages/dill/_dill.py", line 827, in _import_module return getattr(__import__(module,
None, None, [obj]), obj) ModuleNotFoundError: No module named 'google.cloud.functions'
The important part of the error is:
ModuleNotFoundError: No module named 'google.cloud.functions'
My CloudFunction directory looks like this:
/
requirements.txt
main.py
pipeline.py
requirements.txt
# Function dependencies, for example:
# package>=version
apache-beam[gcp]
main.py
import base64
import json
from pipeline import run
def start_job(event, context):
message = base64.b64decode(event['data']).decode('utf-8')
message = json.loads(message)
bucket = message['bucket']
filename = message['name']
if filename.startswith('raw/'):
run(bucket, filename)
print('Job sent to Dataflow')
else:
print('File uploaded to unknow directory: {}'.format(source_file))
pipeline.py
import apache_beam as beam
from datetime import datetime
import argparse
from apache_beam.options.pipeline_options import PipelineOptions
from apache_beam.options.pipeline_options import GoogleCloudOptions
from apache_beam.options.pipeline_options import StandardOptions
from apache_beam.options.pipeline_options import SetupOptions
options = PipelineOptions()
google_cloud_options = options.view_as(GoogleCloudOptions)
google_cloud_options.project = "dpto-bigdata"
google_cloud_options.region = "europe-west1"
google_cloud_options.job_name = "pipeline-test"
google_cloud_options.staging_location = "gs://services-files/staging/"
google_cloud_options.temp_location = "gs://services-files/temp/"
#options.view_as(StandardOptions).runner = "DirectRunner" # use this for debugging
options.view_as(StandardOptions).runner = "DataFlowRunner"
options.view_as(SetupOptions).save_main_session = True
output_suffix = '.csv'
output_header = 'Name,Total,HP,Attack,Defence,Sp_attack,Sp_defence,Speed,Average'
def run(bucket, filename):
source_file = 'gs://{}/{}'.format(bucket, filename)
now = datetime.now().strftime('%Y%m%d-%H%M%S')
output_prefix = 'gs://{}/processed/{}'.format(bucket, now)
with beam.Pipeline(options=options) as p:
raw_values = (
p
| "Read from Cloud Storage" >> beam.io.ReadFromText(source_file, skip_header_lines=1)
| "Split columns" >> beam.Map(lambda x: x.split(','))
| "Cleanup entries" >> beam.ParDo(ElementCleanup())
| "Calculate average stats" >> beam.Map(calculate_average)
| "Format output" >> beam.Map(format_output)
| "Write to Cloud Storage" >> beam.io.WriteToText(file_path_prefix=output_prefix, file_name_suffix=output_suffix, header=output_header)
)
class ElementCleanup(beam.DoFn):
def __init__(self):
self.transforms = self.map_transforms()
def map_transforms(self):
return [
[self.trim, self.to_lowercase], # Name
[self.trim, self.to_float], # Total
[self.trim, self.to_float], # HP
[self.trim, self.to_float], # Attack
[self.trim, self.to_float], # Defence
[self.trim, self.to_float], # Sp_attack
[self.trim, self.to_float], # Sp_defence
[self.trim, self.to_float] # Speed
]
def process(self, row):
return [self.clean_row(row, self.transforms)]
def clean_row(self, row, transforms):
cleaned = []
for idx, col in enumerate(row):
for func in transforms[idx]:
col = func(col)
cleaned.append(col)
return cleaned
def to_lowercase(self, col:str):
return col.lower()
def trim(self, col:str):
return col.strip()
def to_float(self, col:str):
return (float(col) if col != None else None)
def calculate_average(row):
average = round(sum(row[2:]) / len(row[2:]), 2)
row.append(average)
return row
def format_output(row):
row = [str(col) for col in row]
return ','.join(row)
if __name__ == '__main__':
parser = argparse.ArgumentParser()
parser.add_argument(
"--bucket",
help="Bucket to read from."
)
parser.add_argument(
"--filename",
help="File to read from."
)
args = parser.parse_args()
run(args.bucket, args.filename)
I have been reading a while about this topic. Before this error, I had a similar error showing ModuleNotFoundError: No module named 'main'. I was able to fix this adding the pipeline option options.view_as(SetupOptions).save_main_session = True, however I have not find any solution to the error which I am currently facing.
I expected Dataflow workers not to depend on CloudFunction once I had started the pipeline job, but it seems as they are still trying to communicate somehow with it.
I think the best approach here would be to use templates since you are not changing the code but the path. Once you have the template, you can just make an API call to launch them. It surely will be less hassle to set up and probably more resilient, since you would not depend as much on Cloud Functions.
There's another approach that I think would be even better, which doesn't require Cloud Functions. You could use something like MatchAll.continuously from Java. If you need / want Python, there's no counterpart for it yet, but I took the liberty to create a version of it that does the same thing and send a Pull Request for a new Ptransform.
The idea is that every X seconds, you check for new files and process them depending on your pipeline.
If you don't want for the Pull Request to be merged (if so), you can just copy the DoFn:
class MatchContinuously(beam.PTransform):
def __init__(
self,
file_pattern,
interval=360.0,
has_deduplication=True,
start_timestamp=Timestamp.now(),
stop_timestamp=MAX_TIMESTAMP):
self.file_pattern = file_pattern
self.interval = interval
self.has_deduplication = has_deduplication
self.start_ts = start_timestamp
self.stop_ts = stop_timestamp
def expand(self, pcol):
impulse = pcol | PeriodicImpulse(
start_timestamp=self.start_ts,
stop_timestamp=self.stop_ts,
fire_interval=self.interval)
match_files = (
impulse
| beam.Map(lambda x: self.file_pattern)
| MatchAll())
if self.has_deduplication:
match_files = (
match_files
# Making a Key Value so each file has its own state.
| "To KV" >> beam.Map(lambda x: (x.path, x))
| "Remove Already Read" >> beam.ParDo(_RemoveDuplicates()))
return match_files
class _RemoveDuplicates(beam.DoFn):
FILES_STATE = BagStateSpec('files', StrUtf8Coder())
def process(self, element, file_state=beam.DoFn.StateParam(FILES_STATE)):
path = element[0]
file_metadata = element[1]
bag_content = [x for x in file_state.read()]
if not bag_content:
file_state.add(path)
_LOGGER.info("Generated entry for file %s", path)
yield file_metadata
else:
_LOGGER.info("File %s was already read", path)
An example pipeline:
(p | MatchContinuously("gs://apache-beam-samples/shakespeare/*", 180)
| Map(lambda x: x.path)
| ReadAllFromText()
| Map(lambda x: logging.info(x))
)
A third approach could be keep using the GCS notifications and use PubSub + MatchAll. The pipeline would look like:
(p | ReadFromPubSub(topic)
| MatchAll())
)
Depending the frequency of the new files and if you want to use notifications or not, you can decide between the three approaches.
Don't do it that way.
The rationale is:
Cloud function (scales infinitely) - good to process events that happen infrequently.
Cloud run - like cloud functions but you can process more than one event concurrenlty - good for events that come in bursts together (in that case it is cheaper than cloud functions)
Dataflow reading from PubSub -> processing batches (fils) or streams (PubSub/Kafka) of data that come frequently / in a steady way.
Triggering a dataflow job for every file is really inefficient both in time and cost (minutes and $).
If you need to continuously respond to file notifications (finalize, delete, etc.) using Dataflow you should send storage notifications to a pubsub topic and read them from dataflow subscription. Note that this works only for streaming.
If that's your use ReadFromPubsub to read storage notifications:
with beam.Pipeline(options=pipeline_options) as pipeline:
pubsub_msgs = pipeline | (
'Read PubSub Messages' >> beam.io.gcp.pubsub.ReadFromPubSub(subscription=global_vars.input_subscription) )
I have dataflow pipeline, it's in Python and this is what it is doing:
Read Message from PubSub. Messages are zipped protocol buffer. One Message receive on a PubSub contain multiple type of messages. See the protocol parent's message specification below:
message BatchEntryPoint {
/**
* EntryPoint
*
* Description: Encapsulation message
*/
message EntryPoint {
// Proto Message
google.protobuf.Any proto = 1;
// Timestamp
google.protobuf.Timestamp timestamp = 4;
}
// Array of EntryPoint messages
repeated EntryPoint entrypoints = 1;
}
So, to explain a bit better, I have several protobuf messages. Each message must be packed in the proto field of the EntryPoint message, we are sending several messages at once because of MQTT limitations, that's why we then use a repeated field pointing to EntryPoint message on BatchEntryPoint.
Parsing the received messages.
Nothing fancy here, just unzipping and unserializing the message we just read from the PubSub. to get 'humain readable' data.
For Loop on BatchEntryPoint to evaluate each EntryPoint messages.
As Each messages on BatchEntryPoint can have different type, we need to process them differently
Parsed message data
Doing different process to get all information I need and format it to a BigQuery readable format
Write data to bigQuery
This is where my 'trouble' begin, so my code work but it is very dirty in my opinion and hardly maintainable.
There is two things to be aware of.
Each message's type can be send to 3 different datasets, a r&d dataset, a dev dataset and a production dataset.
let's say I have a message named System.
It could go to:
my-project:rd_dataset.system
my-project:dev_dataset.system
my-project:prod_dataset.system
So this is what I am doing now:
console_records | 'Write to Console BQ' >> beam.io.WriteToBigQuery(
lambda e: 'my-project:rd_dataset.table1' if dataset_is_rd_table1(e) else (
'my-project:dev_dataset.table1' if dataset_is_dev_table1(e) else (
'my-project:prod_dataset.table1' if dataset_is_prod_table1(e) else (
'my-project:rd_dataset.table2' if dataset_is_rd_table2(e) else (
'my-project:dev_dataset.table2' if dataset_is_dev_table2(e) else (
...) else 0
I have more than 30 different type of messages, making more of 90 lines for inserting data to big query.
Here is what a dataset_is_..._tableX method looks like:
def dataset_is_rd_messagestype(element) -> bool:
""" check if env is rd for message's type message """
valid: bool = False
is_type = check_element_type(element, 'MessagesType')
if is_type:
valid = dataset_is_rd(element)
return valid
check_element_type Check that the message has the right type (ex: System).
dataset_is_rd looks like this:
def dataset_is_rd(element) -> bool:
""" Check if dataset should be RD from registry id """
if element['device_registry_id'] == 'rd':
del element['device_registry_id']
del element['bq_type']
return True
return False
The element as a key indicating us on which dataset we must send the message.
SO this is working as expected, But I wish I could do cleaner code and maybe reduce the amount of code to change in case of adding or deleting a type of message.
Any ideas?
How about using TaggedOutput.
Can you write something like this instead:
def dataset_type(element) -> bool:
""" Check if dataset should be RD from registry id """
dev_registry = element['device_registry_id']
del element['device_registry_id']
del element['bq_type']
table_type = get_element_type(element, 'MessagesType')
return 'my-project:%s_dataset.table%d' % (dev_registry, table_type)
And use that as the table lambda that you pass to BQ?
So I manage to create code to insert data to dynamic table by crafting the table name dynamically.
This is not perfect because I have to modify the element I pass to the method, however I am still very happy with the result, it has clean up my code from hundreds of line. If I have a new table, adding it would take one line on an array compare to 6 line in the pipeline before.
Here is my solution:
def batch_pipeline(pipeline):
console_message = (
pipeline
| 'Get console\'s message from pub/sub' >> beam.io.ReadFromPubSub(
subscription='sub1',
with_attributes=True)
)
common_message = (
pipeline
| 'Get common\'s message from pub/sub' >> beam.io.ReadFromPubSub(
subscription='sub2',
with_attributes=True)
)
jetson_message = (
pipeline
| 'Get jetson\'s message from pub/sub' >> beam.io.ReadFromPubSub(
subscription='sub3',
with_attributes=True)
)
message = (console_message, common_message, jetson_message) | beam.Flatten()
clear_message = message | beam.ParDo(GetClearMessage())
console_bytes = clear_message | beam.ParDo(SetBytesData())
console_bytes | 'Write to big query back up table' >> beam.io.WriteToBigQuery(
lambda e: write_to_backup(e)
)
records = clear_message | beam.ParDo(GetProtoData())
gps_records = clear_message | 'Get GPS Data' >> beam.ParDo(GetProtoData())
parsed_gps = gps_records | 'Parse GPS Data' >> beam.ParDo(ParseGps())
if parsed_gps:
parsed_gps | 'Write to big query gps table' >> beam.io.WriteToBigQuery(
lambda e: write_gps(e)
)
records | 'Write to big query table' >> beam.io.WriteToBigQuery(
lambda e: write_to_bq(e)
)
So the pipeline is reading from 3 different pub sub, extracting the data and writing to big query.
The structure of an element use by WriteToBigQuery looks like this:
obj = {
'data': data_to_write_on_bq,
'registry_id': data_needed_to_craft_table_name,
'gcloud_id': data_to_write_on_bq,
'proto_type': data_needed_to_craft_table_name
}
and then one of my method used on the lambda on WriteToBigQuery looks like this:
def write_to_bq(e):
logging.info(e)
element = copy(e)
registry = element['registry_id']
logging.info(registry)
dataset = set_dataset(registry) # set dataset name, knowing the registry, this is to set the environment (dev/prod/rd/...)
proto_type = element['proto_type']
logging.info('Proto Type %s', proto_type)
table_name = reduce(lambda x, y: x + ('_' if y.isupper() else '') + y, proto_type).lower()
full_table_name = f'my_project:{dataset}.{table_name}'
logging.info(full_table_name)
del e['registry_id']
del e['proto_type']
return full_table_name
And that's it, after 3 days of trouble !!
I am trying to figure out the performance difference between Map and ParDo, but I cannot run the ParDo method somehow
I have already tried finding some resources that try to solve the problem but I did not find one
ParDo Method (This does not work):
class ci(beam.DoFn):
def compute_interest(self,data_item):
cust_id, cust_data = data_item
if(cust_data['basic'][0]['acc_opened_date']=='2010-10-10'):
new_data = {}
new_data['new_balance'] = (cust_data['account'][0]['cur_bal'] * cust_data['account'][0]['roi']) / 100
new_data.update(cust_data['account'][0])
new_data.update(cust_data['basic'][0])
del new_data['cur_bal']
return new_data
Map Method (This works):
def compute_interest(data_item):
cust_id, cust_data = data_item
if(cust_data['basic'][0]['acc_opened_date']=='2010-10-10'):
new_data = {}
new_data['new_balance'] = (cust_data['account'][0]['cur_bal'] * cust_data['account'][0]['roi']) / 100
new_data.update(cust_data['account'][0])
new_data.update(cust_data['basic'][0])
del new_data['cur_bal']
return new_data
ERROR:
raise NotImplementedError
RuntimeError: NotImplementedError [while running 'PIPELINE NAME']
Beam.DoFn expects a process method instead:
def process(self, element):
As explained in section 4.2.1.2 of the Beam programming guide:
Inside your DoFn subclass, you’ll write a method process where you provide the actual processing logic. You don’t need to manually extract the elements from the input collection; the Beam SDKs handle that for you. Your process method should accept an object of type element. This is the input element and output is emitted by using yield or return statement inside process method.
As an example we'll define both Map and ParDo functions:
def compute_interest_map(data_item):
return data_item + 1
class compute_interest_pardo(beam.DoFn):
def process(self, element):
yield element + 2
If you change process for another method name you'll get the NotImplementedError.
And the main pipeline will be:
events = (p
| 'Create' >> beam.Create([1, 2, 3]) \
| 'Add 1' >> beam.Map(lambda x: compute_interest_map(x)) \
| 'Add 2' >> beam.ParDo(compute_interest_pardo()) \
| 'Print' >> beam.ParDo(log_results()))
Output:
INFO:root:>> Interest: 4
INFO:root:>> Interest: 5
INFO:root:>> Interest: 6
code
Current situation
The porpouse of this pipeline is to read from pub/sub the payload with geodata, then this data are transformed and analyzed and finally return if a condition is true or false
with beam.Pipeline(options=pipeline_options) as p:
raw_data = (p
| 'Read from PubSub' >> beam.io.ReadFromPubSub(
subscription='projects/XXX/subscriptions/YYY'))
geo_data = (raw_data
| 'Geo data transform' >> beam.Map(lambda s: GeoDataIngestion(s)))
def GeoDataIngestion(string_input):
<...>
return True or False
Desirable situation 1
If the GeoDataIngestion result is true, then the raw_data will be stored in big query
geo_data = (raw_data
| 'Geo data transform' >> beam.Map(lambda s: GeoDataIngestion(s))
| 'Evaluate condition' >> beam.Map(lambda s: Condition(s))
)
def Condition(condition):
if condition:
<...WriteToBigQuery...>
#The class I used before to store raw_data without depending on evaluate condition:
class WriteToBigQuery(beam.PTransform):
def expand(self, pcoll):
return (
pcoll
| 'Format' >> beam.ParDo(FormatBigQueryFn())
| 'Write to BigQuery' >> beam.io.WriteToBigQuery(
'XXX',
schema=TABLE_SCHEMA,
create_disposition=beam.io.BigQueryDisposition.CREATE_IF_NEEDED,
write_disposition=beam.io.BigQueryDisposition.WRITE_APPEND))
Desirable situation 2
Instead of store the data in BigQuery, it would be also good to send to pub/sub
def Condition(condition):
if condition:
<...SendToPubSub(Topic1)...>
else:
<...SendToPubSub(Topic2)...>
Here, the problem is to set the Topic depending of the condition result, because i'm not able to pass the topic like parameter in the pipeline
| beam.io.WriteStringsToPubSub(TOPIC)
Neither in a function/class
Question
How can I do that?
How/where should I call WriteToBigQuery to store the PCollection raw_data if the result of Evaluate condition is true?
I think branching collections based on the evaluation condition result might be helpful for your scenario. Please see the documentation here.
To illustrate the branching, suppose I have a collection below, where you want to do different action based on the content of the string.
'this line is for BigQuery',
'this line for pubsub topic1',
'this line for pubsub topic2'
The code below will create tag the collection and you can get three different PCollections based on the tag. Then you can decide what further actions you want to perform on the individual collections.
import apache_beam as beam
from apache_beam import pvalue
import sys
class Split(beam.DoFn):
# These tags will be used to tag the outputs of this DoFn.
OUTPUT_TAG_BQ = 'BigQuery'
OUTPUT_TAG_PS1 = 'pubsub topic1'
OUTPUT_TAG_PS2 = 'pubsub topic2'
def process(self, element):
"""
tags the input as it processes the orignal PCollection
"""
print element
if "BigQuery" in element:
yield pvalue.TaggedOutput(self.OUTPUT_TAG_BQ, element)
print 'found bq'
elif "pubsub topic1" in element:
yield pvalue.TaggedOutput(self.OUTPUT_TAG_PS1, element)
elif "pubsub topic2" in element:
yield pvalue.TaggedOutput(self.OUTPUT_TAG_PS2, element)
if __name__ == '__main__':
output_prefix = 'C:\\pythonVirtual\\Mycodes\\output'
p = beam.Pipeline(argv=sys.argv)
lines = (p
| beam.Create([
'this line is for BigQuery',
'this line for pubsub topic1',
'this line for pubsub topic2']))
# with_outputs allows accessing the explicitly tagged outputs of a DoFn.
tagged_lines_result = (lines
| beam.ParDo(Split()).with_outputs(
Split.OUTPUT_TAG_BQ,
Split.OUTPUT_TAG_PS1,
Split.OUTPUT_TAG_PS2))
# tagged_lines_result is an object of type DoOutputsTuple. It supports
# accessing result in alternative ways.
bq_records = tagged_lines_result[Split.OUTPUT_TAG_BQ]| "write BQ" >> beam.io.WriteToText(output_prefix + 'bq')
ps1_records = tagged_lines_result[Split.OUTPUT_TAG_PS1] | "write PS1" >> beam.io.WriteToText(output_prefix + 'ps1')
ps2_records = tagged_lines_result[Split.OUTPUT_TAG_PS2] | "write PS2" >> beam.io.WriteToText(output_prefix + 'ps2')
p.run().wait_until_finish()
Please let me know if that helps.
I was using FileSystems inside a ParDo to be able to write to dynamic destinations in data storage.
However I was not able to do automatic sharding as in Text.IO using wildcard for filename?
Is there a way that I can do automatic sharding in FileSystems.create?
Edited:
This is the pipeline I used to run, the part of code in question is WriteToStorage where I want to write the result to date{week}at{year}/results*.json
with beam.Pipeline(options=pipeline_options) as p:
pcoll = (p | ReadFromText(known_args.input)
| beam.ParDo(WriteDecompressedFile())
| beam.Map(lambda x: ('{week}at{year}'.format(week=x['week'], year=x['year']), x))
| beam.GroupByKey()
| beam.ParDo(WriteToStorage()))
Here's the current version of WriteToStorage()
class WriteToStorage(beam.DoFn):
def __init__(self):
self.writer = None
def process(self, element):
(key, val) = element
week, year = [int(x) for x in key.split('at')]
if self.writer == None:
path = known_args.output + 'date-{week}at{year}/revisions-from-{rev}.json'.format(week=week, year=year, rev=element['rev_id'])
self.writer = filesystems.FileSystems.create(path)
logging.info('USERLOG: Write to path %s.'%path)
logging.info('TESTLOG: %s.'%type(val))
for output in val:
self.writer.write(json.dumps(output) + '\n')
def finish_bundle(self):
if not(self.writer == None):
self.writer.close()
Thank you.
You can use the StartBundle() method of your DoFn to open a connection to a new file for each worker.
However, you have to figure out a way to independently name your files. Text.IO seems to do this with _RoundRobinKeyFn.
A simpler way would be to use a timestamp for name generation, but I don't know how foolproof this would be.