How to run multiple WriteToBigQuery parallel in google cloud dataflow / apache beam?

How to run multiple WriteToBigQuery parallel in google cloud dataflow / apache beam? - python

I want to separate event from a bunch of multiple events, given data
{"type": "A", "k1": "v1"}
{"type": "B", "k2": "v2"}
{"type": "C", "k3": "v3"}
And I want to separate type: A events to table A in bigquery, type:B events to table B, type: C events to table C.
Here are my codes implemented through apache beam python sdk and write data to bigquery,
A_schema = 'type:string, k1:string'
B_schema = 'type:string, k2:string'
C_schema = 'type:string, k2:string'
class ParseJsonDoFn(beam.DoFn):
A_TYPE = 'tag_A'
B_TYPE = 'tag_B'
C_TYPE = 'tag_C'
def process(self, element):
text_line = element.trip()
data = json.loads(text_line)
if data['type'] == 'A':
yield pvalue.TaggedOutput(self.A_TYPE, data)
elif data['type'] == 'B':
yield pvalue.TaggedOutput(self.B_TYPE, data)
elif data['type'] == 'C':
yield pvalue.TaggedOutput(self.C_TYPE, data)
def run():
parser = argparse.ArgumentParser()
parser.add_argument('--input',
dest='input',
default='data/path/data',
help='Input file to process.')
known_args, pipeline_args = parser.parse_known_args(argv)
pipeline_args.extend([
'--runner=DirectRunner',
'--project=project-id',
'--job_name=seperate-bi-events-job',
])
pipeline_options = PipelineOptions(pipeline_args)
pipeline_options.view_as(SetupOptions).save_main_session = True
with beam.Pipeline(options=pipeline_options) as p:
lines = p | ReadFromText(known_args.input)
multiple_lines = (
lines
| 'ParseJSON' >> (beam.ParDo(ParseJsonDoFn()).with_outputs(
ParseJsonDoFn.A_TYPE,
ParseJsonDoFn.B_TYPE,
ParseJsonDoFn.C_TYPE)))
a_line = multiple_lines.tag_A
b_line = multiple_lines.tag_B
c_line = multiple_lines.tag_C
(a_line
| "output_a" >> beam.io.WriteToBigQuery(
'temp.a',
schema = A_schema,
write_disposition = beam.io.BigQueryDisposition.WRITE_TRUNCATE,
create_disposition = beam.io.BigQueryDisposition.CREATE_IF_NEEDED
))
(b_line
| "output_b" >> beam.io.WriteToBigQuery(
'temp.b',
schema = B_schema,
write_disposition = beam.io.BigQueryDisposition.WRITE_TRUNCATE,
create_disposition = beam.io.BigQueryDisposition.CREATE_IF_NEEDED
))
(c_line
| "output_c" >> beam.io.WriteToBigQuery(
'temp.c',
schema = (C_schema),
write_disposition = beam.io.BigQueryDisposition.WRITE_TRUNCATE,
create_disposition = beam.io.BigQueryDisposition.CREATE_IF_NEEDED
))
p.run().wait_until_finish()
The output:
INFO:root:start <DoOperation output_banner/WriteToBigQuery output_tags=['out']>
INFO:oauth2client.transport:Attempting refresh to obtain initial access_token
INFO:oauth2client.client:Refreshing access_token
WARNING:root:Sleeping for 150 seconds before the write as BigQuery inserts can be routed to deleted table for 2 mins after the delete and create.
INFO:root:start <DoOperation output_banner/WriteToBigQuery output_tags=['out']>
INFO:oauth2client.transport:Attempting refresh to obtain initial access_token
INFO:oauth2client.client:Refreshing access_token
WARNING:root:Sleeping for 150 seconds before the write as BigQuery inserts can be routed to deleted table for 2 mins after the delete and create.
INFO:root:start <DoOperation output_banner/WriteToBigQuery output_tags=['out']>
INFO:oauth2client.transport:Attempting refresh to obtain initial access_token
INFO:oauth2client.client:Refreshing access_token
WARNING:root:Sleeping for 150 seconds before the write as BigQuery inserts can be routed to deleted table for 2 mins after the delete and create.
However, there are two issues here
there is no data in bigquery?
From the logs it seems the codes does NOT run parallel rather than run 3 times sequence?
Is there something wrong in my codes or something am I missing?

there is no data in bigquery?
Your code seems to be perfectly fine as data is written to BigQuery (C_schema should be k3 instead of k2). Keep in mind that you are streaming data so you won't see it if you click on the Preview table button until data in the streaming buffer is committed. Running a SELECT * query will display the expected results:
From the logs it seems the codes does NOT run parallel rather than run 3 times sequence?
Ok, this is interesting. By tracing the WARNING message in the code we can read the following:
# if write_disposition == BigQueryDisposition.WRITE_TRUNCATE we delete
# the table before this point.
if write_disposition == BigQueryDisposition.WRITE_TRUNCATE:
# BigQuery can route data to the old table for 2 mins max so wait
# that much time before creating the table and writing it
logging.warning('Sleeping for 150 seconds before the write as ' +
'BigQuery inserts can be routed to deleted table ' +
'for 2 mins after the delete and create.')
# TODO(BEAM-2673): Remove this sleep by migrating to load api
time.sleep(150)
return created_table
else:
return created_table
After reading BEAM-2673 and BEAM-2801, seems like this is due to an issue with the BigQuery sink using the Streaming API with the DirectRunner. This will cause the process to sleep for 150s when re-creating the table and this is not done in parallel.
If, instead, we run it on Dataflow (using the DataflowRunner, providing a staging and temp bucket path as well as loading the input data from GCS too) this will run three import jobs in parallel. See, in the below image, that all three start at 22:19:45 and finish at 22:19:56:

Related

Multiprocessing In Django Function

Is it possible to use multi processing in Django on a request.
#so if I send a request to http://127.0.0.1:8000/wallet_verify
def wallet_verify(request):
walelts = botactive.objects.all()
#here I check if the user want to be included in the process or not so if they set it to True then i'll include them else ignore.
for active in walelts:
check_active = active.active
if check_active == True:
user_is_active = active.user
#for the ones that want to be included I then go to get their key data.
I need to get both api and secret so then I loop through to get the data from active users.
database = Bybitapidatas.objects.filter(user=user_is_active)
for apikey in database:
apikey = apikey.apikey
for apisecret in database:
apisecret = apisecret.apisecret
#since I am making a request to an exchange endpoint I can only include one API and secret at a time . So for 1 person at a time this is why I want to run in parallel.
for a, b in zip(list(Bybitapidatas.objects.filter(user=user_is_active).values("apikey")), list(Bybitapidatas.objects.filter(user=user_is_active).values("apisecret"))):
session =spot.HTTP(endpoint='https://api-testnet.bybit.com/', api_key=a['apikey'], api_secret=b['apisecret'])
#here I check to see if they have balance to open trades if they have selected to be included.
GET_USDT_BALANCE = session.get_wallet_balance()['result']['balances']
for i in GET_USDT_BALANCE:
if 'USDT' in i.values():
GET_USDT_BALANCE = session.get_wallet_balance()['result']['balances']
idx_USDT = GET_USDT_BALANCE.index(i)
GET_USDTBALANCE = session.get_wallet_balance()['result']['balances'][idx_USDT]['free']
print(round(float(GET_USDTBALANCE),2))
#if they don't have enough balance I skip the user.
if round(float(GET_USDTBALANCE),2) < 11 :
pass
else:
session.place_active_order(
symbol="BTCUSDT",
side="Buy",
type="MARKET",
qty=10,
timeInForce="GTC"
)
How can I run this process in parallel while looping through the database to also get data for each individual user.
I am still new to coding so hope I explained that it makes sense.
I have tried multiprocessing and pools but then I get that the app has not started yet and I have to run it outside of wallet_verify is there a way to do it in wallet_verify
and when I send the Post Request.
Any help appreciated.

Filtering the Database to get Users who have set it to True
Listi - [1,3](these are user ID's Returned
processess = botactive.objects.filter(active=True).values_list('user')
listi = [row[0] for row in processess]
Get the Users from the listi and perform the action.
def wallet_verify(listi):
# print(listi)
database = Bybitapidatas.objects.filter(user = listi)
print("---------------------------------------------------- START")
for apikey in database:
apikey = apikey.apikey
print(apikey)
for apisecret in database:
apisecret = apisecret.apisecret
print(apisecret)
start_time = time.time()
session =spot.HTTP(endpoint='https://api-testnet.bybit.com/', api_key=apikey, api_secret=apisecret)
GET_USDT_BALANCE = session.get_wallet_balance()['result']['balances']
for i in GET_USDT_BALANCE:
if 'USDT' in i.values():
GET_USDT_BALANCE = session.get_wallet_balance()['result']['balances']
idx_USDT = GET_USDT_BALANCE.index(i)
GET_USDTBALANCE = session.get_wallet_balance()['result']['balances'][idx_USDT]['free']
print(round(float(GET_USDTBALANCE),2))
if round(float(GET_USDTBALANCE),2) < 11 :
pass
else:
session.place_active_order(
symbol="BTCUSDT",
side="Buy",
type="MARKET",
qty=10,
timeInForce="GTC"
)
print ("My program took", time.time() - start_time, "to run")
print("---------------------------------------------------- END")
return HttpResponse("Wallets verified")
Verifyt is what I use for the multiprocessing since I don't want it to run without being requested to run. also initialiser starts apps for each loop
def verifyt(request):
with ProcessPoolExecutor(max_workers=4, initializer=django.setup) as executor:
results = executor.map(wallet_verify, listi)
return HttpResponse("done")
```

Dynamically set bigquery table id in dataflow pipeline

I have dataflow pipeline, it's in Python and this is what it is doing:
Read Message from PubSub. Messages are zipped protocol buffer. One Message receive on a PubSub contain multiple type of messages. See the protocol parent's message specification below:
message BatchEntryPoint {
/**
* EntryPoint
*
* Description: Encapsulation message
*/
message EntryPoint {
// Proto Message
google.protobuf.Any proto = 1;
// Timestamp
google.protobuf.Timestamp timestamp = 4;
}
// Array of EntryPoint messages
repeated EntryPoint entrypoints = 1;
}
So, to explain a bit better, I have several protobuf messages. Each message must be packed in the proto field of the EntryPoint message, we are sending several messages at once because of MQTT limitations, that's why we then use a repeated field pointing to EntryPoint message on BatchEntryPoint.
Parsing the received messages.
Nothing fancy here, just unzipping and unserializing the message we just read from the PubSub. to get 'humain readable' data.
For Loop on BatchEntryPoint to evaluate each EntryPoint messages.
As Each messages on BatchEntryPoint can have different type, we need to process them differently
Parsed message data
Doing different process to get all information I need and format it to a BigQuery readable format
Write data to bigQuery
This is where my 'trouble' begin, so my code work but it is very dirty in my opinion and hardly maintainable.
There is two things to be aware of.
Each message's type can be send to 3 different datasets, a r&d dataset, a dev dataset and a production dataset.
let's say I have a message named System.
It could go to:
my-project:rd_dataset.system
my-project:dev_dataset.system
my-project:prod_dataset.system
So this is what I am doing now:
console_records | 'Write to Console BQ' >> beam.io.WriteToBigQuery(
lambda e: 'my-project:rd_dataset.table1' if dataset_is_rd_table1(e) else (
'my-project:dev_dataset.table1' if dataset_is_dev_table1(e) else (
'my-project:prod_dataset.table1' if dataset_is_prod_table1(e) else (
'my-project:rd_dataset.table2' if dataset_is_rd_table2(e) else (
'my-project:dev_dataset.table2' if dataset_is_dev_table2(e) else (
...) else 0
I have more than 30 different type of messages, making more of 90 lines for inserting data to big query.
Here is what a dataset_is_..._tableX method looks like:
def dataset_is_rd_messagestype(element) -> bool:
""" check if env is rd for message's type message """
valid: bool = False
is_type = check_element_type(element, 'MessagesType')
if is_type:
valid = dataset_is_rd(element)
return valid
check_element_type Check that the message has the right type (ex: System).
dataset_is_rd looks like this:
def dataset_is_rd(element) -> bool:
""" Check if dataset should be RD from registry id """
if element['device_registry_id'] == 'rd':
del element['device_registry_id']
del element['bq_type']
return True
return False
The element as a key indicating us on which dataset we must send the message.
SO this is working as expected, But I wish I could do cleaner code and maybe reduce the amount of code to change in case of adding or deleting a type of message.
Any ideas?

How about using TaggedOutput.

Can you write something like this instead:
def dataset_type(element) -> bool:
""" Check if dataset should be RD from registry id """
dev_registry = element['device_registry_id']
del element['device_registry_id']
del element['bq_type']
table_type = get_element_type(element, 'MessagesType')
return 'my-project:%s_dataset.table%d' % (dev_registry, table_type)
And use that as the table lambda that you pass to BQ?

So I manage to create code to insert data to dynamic table by crafting the table name dynamically.
This is not perfect because I have to modify the element I pass to the method, however I am still very happy with the result, it has clean up my code from hundreds of line. If I have a new table, adding it would take one line on an array compare to 6 line in the pipeline before.
Here is my solution:
def batch_pipeline(pipeline):
console_message = (
pipeline
| 'Get console\'s message from pub/sub' >> beam.io.ReadFromPubSub(
subscription='sub1',
with_attributes=True)
)
common_message = (
pipeline
| 'Get common\'s message from pub/sub' >> beam.io.ReadFromPubSub(
subscription='sub2',
with_attributes=True)
)
jetson_message = (
pipeline
| 'Get jetson\'s message from pub/sub' >> beam.io.ReadFromPubSub(
subscription='sub3',
with_attributes=True)
)
message = (console_message, common_message, jetson_message) | beam.Flatten()
clear_message = message | beam.ParDo(GetClearMessage())
console_bytes = clear_message | beam.ParDo(SetBytesData())
console_bytes | 'Write to big query back up table' >> beam.io.WriteToBigQuery(
lambda e: write_to_backup(e)
)
records = clear_message | beam.ParDo(GetProtoData())
gps_records = clear_message | 'Get GPS Data' >> beam.ParDo(GetProtoData())
parsed_gps = gps_records | 'Parse GPS Data' >> beam.ParDo(ParseGps())
if parsed_gps:
parsed_gps | 'Write to big query gps table' >> beam.io.WriteToBigQuery(
lambda e: write_gps(e)
)
records | 'Write to big query table' >> beam.io.WriteToBigQuery(
lambda e: write_to_bq(e)
)
So the pipeline is reading from 3 different pub sub, extracting the data and writing to big query.
The structure of an element use by WriteToBigQuery looks like this:
obj = {
'data': data_to_write_on_bq,
'registry_id': data_needed_to_craft_table_name,
'gcloud_id': data_to_write_on_bq,
'proto_type': data_needed_to_craft_table_name
}
and then one of my method used on the lambda on WriteToBigQuery looks like this:
def write_to_bq(e):
logging.info(e)
element = copy(e)
registry = element['registry_id']
logging.info(registry)
dataset = set_dataset(registry) # set dataset name, knowing the registry, this is to set the environment (dev/prod/rd/...)
proto_type = element['proto_type']
logging.info('Proto Type %s', proto_type)
table_name = reduce(lambda x, y: x + ('_' if y.isupper() else '') + y, proto_type).lower()
full_table_name = f'my_project:{dataset}.{table_name}'
logging.info(full_table_name)
del e['registry_id']
del e['proto_type']
return full_table_name
And that's it, after 3 days of trouble !!

Fetching data in realtime from database in python

I have a class for multiprocessing in Python which creates 3 different processes. First process is for checking if there is any signal from my hardware and pushing it into a Queue, second process is for getting the data out of the Queue and pushing it into a database and the third processes is for getting the data out of the database and pushing it on a server.
obj = QE()
stdFunct = standardFunctions()
watchDogProcess = multiprocessing.Process(target=obj.watchDog)
watchDogProcess.start()
pushToDBSProcess = multiprocessing.Process(target=obj.pushToDBS)
pushToDBSProcess.start()
pushToCloud = multiprocessing.Process(target=stdFunct.uploadCycleTime)
pushToCloud.start()
watchDogProcess.join()
pushToDBSProcess.join()
pushToCloud.join()
My first two processes are running perfectly as desired, however I am struggling with the third process. The following is the code of my third process :
def uploadCycleTime(self):
while True:
uploadCycles = []
lastUpPointer = "SELECT id FROM lastUploaded"
lastUpPointer = self.dbFetchone(lastUpPointer)
lastUpPointer = lastUpPointer[0]
# print("lastUploaded :"+str(lastUpPointer))
cyclesToUploadSQL = "SELECT id,machineId,startDateTime,endDateTime,type FROM cycletimes WHERE id > "+str(lastUpPointer)
cyclesToUpload = self.dbfetchMany(cyclesToUploadSQL,15)
cyclesUploadLength = len(cyclesToUpload)
if(cyclesUploadLength>0):
for cycles in cyclesToUpload:
uploadCycles.append({"dataId":cycles[0],"machineId":cycles[1],"startDateTime":cycles[2].strftime('%Y-%m-%d %H:%M:%S.%f'),"endDateTime":cycles[3].strftime('%Y-%m-%d %H:%M:%S.%f'),"type":cycles[4]})
# print("length : "+str(cyclesUploadLength))
lastUpPointer = uploadCycles[cyclesUploadLength-1]["dataId"]
uploadCycles = json.dumps(uploadCycles)
api = self.dalUrl+"/cycle-times"
uploadResponse = self.callPostAPI(api,str(uploadCycles))
print(lastUpPointer)
changePointerSQL = "UPDATE lastUploaded SET id="+str(lastUpPointer)
try:
changePointerSQL = self.dbAbstraction(changePointerSQL)
except Exception as errorPointer:
print("Pointer change Error : "+str(errorPointer))
time.sleep(2)
Now I am saving a pointer to remember the last id uploaded, and from there on keep uploading 15 packets. When there is data existing in the DB the code works well, however if there is no existing when the process is initiated and data is sent afterwards then it fails to fetch the data from the DB.
I tried printing the length in realtime, it keeps giving me 0, inspite of data being continuously pushed into the DB in real-time.

In my upload process, I missed out on a commit()
def dbFetchAll(self,dataString):
# dataToPush = self.cycletimeQueue.get()
# print(dataToPush)
dbTry = 1
try:
while(dbTry == 1): # This while is to ensure the data has been pushed
sql = dataString
self.conn.execute(sql)
response = self.conn.fetchall()
dbTry = 0
return response
# print(self.conn.rowcount, "record inserted.")
except Exception as error:
print ("Error : "+str(error))
return dbTry
***finally:
self.mydb.commit()***

Slowly Changing Lookup Cache from BigQuery - Dataflow Python Streaming SDK

I am trying to follow the design pattern for Slowly Changing Lookup Cache (https://cloud.google.com/blog/products/gcp/guide-to-common-cloud-dataflow-use-case-patterns-part-1) for a streaming pipeline using the Python SDK for Apache Beam on DataFlow.
Our reference table for the lookup cache sits in BigQuery, and we are able to read and pass it in as a Side Input to the ParDo operation but it does not refresh regardless of how we set up the trigger/windows.
class FilterAlertDoFn(beam.DoFn):
def process(self, element, alertlist):
print len(alertlist)
print alertlist
… # function logic
alert_input = (p | beam.io.Read(beam.io.BigQuerySource(query=ALERT_QUERY))
| ‘alert_side_input’ >> beam.WindowInto(
beam.window.GlobalWindows(),
trigger=trigger.RepeatedlyTrigger(trigger.AfterWatermark(
late=trigger.AfterCount(1)
)),
accumulation_mode=trigger.AccumulationMode.ACCUMULATING
)
| beam.Map(lambda elem: elem[‘SOMEKEY’])
)
...
main_input | ‘alerts’ >> beam.ParDo(FilterAlertDoFn(), beam.pvalue.AsList(alert_input))
Based on the I/O page here (https://beam.apache.org/documentation/io/built-in/) it says Python SDK supports streaming for the BigQuery Sink only, does that mean that BQ reads are a bounded source and therefore can’t be refreshed in this method?
Trying to set non-global windows on the source results in an empty PCollection in the Side Input.
UPDATE:
When trying to implement the strategy suggested by Pablo's answer, the ParDo operation that uses the side input wont run.
There is a single input source that goes to two output's, one of then using the Side Input. The Non-SideInput will still reach it's destination and the SideInput pipeline wont enter the FilterAlertDoFn().
By substituting the side input for a dummy value the pipeline will enter the function. Is it perhaps waiting for a suitable window that doesn't exist?
With the same FilterAlertDoFn() as above, my side_input and call now look like this:
def refresh_side_input(_):
query = 'select col from table'
client = bigquery.Client(project='gcp-project')
query_job = client.query(query)
return query_job.result()
trigger_input = ( p | 'alert_ref_trigger' >> beam.io.ReadFromPubSub(
subscription=known_args.trigger_subscription))
bigquery_side_input = beam.pvalue.AsSingleton((trigger_input
| beam.WindowInto(beam.window.GlobalWindows(),
trigger=trigger.Repeatedly(trigger.AfterCount(1)),
accumulation_mode=trigger.AccumulationMode.DISCARDING)
| beam.Map(refresh_side_input)
))
...
# Passing this as side input doesn't work
main_input | 'alerts' >> beam.ParDo(FilterAlertDoFn(), bigquery_side_input)
# Passing dummy variable as side input does work
main_input | 'alerts' >> beam.ParDo(FilterAlertDoFn(), [1])
I tried a few different versions of refresh_side_input(), They report the expect result when checking the return inside the function.
UPDATE 2:
I made some minor modifications to Pablo's code, and I get the same behaviour - the DoFn never executes.
In the below example I will see 'in_load_conversion_data' whenever I post to some_other_topic but will never see 'in_DoFn' when posting to some_topic
import apache_beam as beam
import apache_beam.transforms.window as window
from apache_beam.transforms import trigger
from apache_beam.options.pipeline_options import PipelineOptions
from apache_beam.options.pipeline_options import SetupOptions
from apache_beam.options.pipeline_options import StandardOptions
def load_my_conversion_data():
return {'EURUSD': 1.1, 'USDMXN': 4.4}
def load_conversion_data(_):
# I will suppose that these are currency conversions. E.g.
# {'EURUSD': 1.1, 'USDMXN' 20,}
print 'in_load_conversion_data'
return load_my_conversion_data()
class ConvertTo(beam.DoFn):
def __init__(self, target_currency):
self.target_currency = target_currency
def process(self, elm, rates):
print 'in_DoFn'
elm = elm.attributes
if elm['currency'] == self.target_currency:
yield elm
elif ' % s % s' % (elm['currency'], self.target_currency) in rates:
rate = rates[' % s % s' % (elm['currency'], self.target_currency)]
result = {}.update(elm).update({'currency': self.target_currency,
'value': elm['value']*rate})
yield result
else:
return # We drop that value
pipeline_options = PipelineOptions()
pipeline_options.view_as(StandardOptions).streaming = True
p = beam.Pipeline(options=pipeline_options)
some_topic = 'projects/some_project/topics/some_topic'
some_other_topic = 'projects/some_project/topics/some_other_topic'
with beam.Pipeline(options=pipeline_options) as p:
table_pcv = beam.pvalue.AsSingleton((
p
| 'some_other_topic' >> beam.io.ReadFromPubSub(topic=some_other_topic, with_attributes=True)
| 'some_other_window' >> beam.WindowInto(window.GlobalWindows(),
trigger=trigger.Repeatedly(trigger.AfterCount(1)),
accumulation_mode=trigger.AccumulationMode.DISCARDING)
| beam.Map(load_conversion_data)))
_ = (p | 'some_topic' >> beam.io.ReadFromPubSub(topic=some_topic)
| 'some_window' >> beam.WindowInto(window.FixedWindows(1))
| beam.ParDo(ConvertTo('USD'), rates=table_pcv))

As you well point out, the Java SDK allows you to use more streaming utilites such as timers and state. These utilities help the implementation of pipelines like these.
The Python SDK lacks some of these utilities, and specifically timers. For that reason, we need to use a hack, where the reload of the side input can be triggered by inserting messages into our some_other_topic in PubSub.
This also means that you have to manually perform a lookup into BigQuery. You can probably use the apache_beam.io.gcp.bigquery_tools.BigQueryWrapper class to perform lookups directly into BigQuery.
Here is an example of a pipeline that refreshes some currency conversion data. I haven't tested it, but I'm 90% sure it'll work with only few adjustments. Let me know if this helps.
pipeline_options = PipelineOptions()
p = beam.Pipeline(options=pipeline_options)
def load_conversion_data(_):
# I will suppose that these are currency conversions. E.g.
# {‘EURUSD’: 1.1, ‘USDMXN’ 20, …}
return external_service.load_my_conversion_data()
table_pcv = beam.pvalue.AsSingleton((
p
| beam.io.gcp.ReadFromPubSub(topic=some_other_topic)
| WindowInto(window.GlobalWindow(),
trigger=trigger.Repeatedly(trigger.AfterCount(1),
accumulation_mode=trigger.AccumulationMode.DISCARDING)
| beam.Map(load_conversion_data)))
class ConvertTo(beam.DoFn):
def __init__(self, target_currency):
self.target_currenct = target_currency
def process(self, elm, rates):
if elm[‘currency’] == self.target_currency:
yield elm
elif ‘%s%s’ % (elm[‘currency’], self.target_currency) in rates:
rate = rates[‘%s%s’ % (elm[‘currency’], self.target_currency)]
result = {}.update(elm).update({‘currency’: self.target_currency,
‘value’: elm[‘value’]*rate})
yield result
else:
return # We drop that value
_ = (p
| beam.io.gcp.ReadFromPubSub(topic=some_topic)
| beam.WindowInto(window.FixedWindows(1))
| beam.ParDo(ConvertTo(‘USD’), rates=table_pcv))

Is it possible to have both Pub/Sub and BigQuery as inputs in Google Dataflow?

In my project, I am looking to use a streaming pipeline in Google Dataflow in order to process Pub/Sub messages. In cleaning the input data, I am looking to also have a side input from BigQuery. This has presented a problem that will cause one of the two inputs to not work.
I have set in my Pipeline options for streaming=True, which allows the Pub/Sub inputs to process properly. But BigQuery is not compatible with streaming pipelines (see link below):
https://cloud.google.com/dataflow/docs/resources/faq#what_are_the_current_limitations_of_streaming_mode
I received this error: "ValueError: Cloud Pub/Sub is currently available for use only in streaming pipelines." This is understandable based on the limitations.
But I am only looking to use BigQuery as a side input in order to map data to the incoming Pub/Sub data stream. It works fine locally, but once I try to run it on Dataflow, it returns the error.
Has anyone found a good workaround for this?
EDIT: adding the framework of my pipeline below for reference:
# Set all options needed to properly run the pipeline
options = PipelineOptions(streaming=True,
runner='DataflowRunner',
project=project_id)
p = beam.Pipeline(options = options)
n_tbl_src = (p
| 'Nickname Table Read' >> beam.io.Read(beam.io.BigQuerySource(
table = nickname_spec
)))
# This is the main Dataflow pipeline. This will clean the incoming dataset for importing into BQ.
clean_vote = (p
| beam.io.gcp.pubsub.ReadFromPubSub(topic = None,
subscription = 'projects/{0}/subscriptions/{1}'
.format(project_id, subscription_name),
with_attributes = True)
| 'Isolate Attributes' >> beam.ParDo(IsolateAttrFn())
| 'Fix Value Types' >> beam.ParDo(FixTypesFn())
| 'Scrub First Name' >> beam.ParDo(ScrubFnameFn())
| 'Fix Nicknames' >> beam.ParDo(FixNicknameFn(), n_tbl=AsList(n_tbl_src))
| 'Scrub Last Name' >> beam.ParDo(ScrubLnameFn()))
# The final dictionary will then be written to BigQuery for storage
(clean_vote | 'Write to BQ' >> beam.io.WriteToBigQuery(
table = bq_spec,
write_disposition = beam.io.BigQueryDisposition.WRITE_APPEND,
create_disposition = beam.io.BigQueryDisposition.CREATE_NEVER
))
# Run the pipeline
p.run()

#Pablo's comment above was the correct answer. For anyone working through the same situation, below is the change in my script that worked.
# This opens the Beam pipeline to run Dataflow
p = beam.Pipeline(options = options)
logging.info('Created Dataflow pipeline.')
# This will pull in all of the recorded nicknames to compare to the incoming PubSubMessages.
client = bigquery.Client()
query_job = client.query("""
select * from `{0}.{1}.{2}`""".format(project_id, dataset_id, nickname_table_id))
nickname_tbl = query_job.result()
nickname_tbl = [dict(row.items()) for row in nickname_tbl]
# This is the main Dataflow pipeline. This will clean the incoming dataset for importing into BQ.
clean_vote = (p
| beam.io.gcp.pubsub.ReadFromPubSub(topic = None,
subscription = 'projects/{0}/subscriptions/{1}'
.format(project_id, subscription_name),
with_attributes = True)
| 'Isolate Attributes' >> beam.ParDo(IsolateAttrFn())
| 'Fix Value Types' >> beam.ParDo(FixTypesFn())
| 'Scrub First Name' >> beam.ParDo(ScrubFnameFn())
| 'Fix Nicknames' >> beam.ParDo(FixNicknameFn(), n_tbl=nickname_tbl)
| 'Scrub Last Name' >> beam.ParDo(ScrubLnameFn()))
# The final dictionary will then be written to BigQuery for storage
(clean_vote | 'Write to BQ' >> beam.io.WriteToBigQuery(
table = bq_spec,
write_disposition = beam.io.BigQueryDisposition.WRITE_APPEND,
create_disposition = beam.io.BigQueryDisposition.CREATE_NEVER
))
# Run the pipeline
p.run()

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

How to run multiple WriteToBigQuery parallel in google cloud dataflow / apache beam? - python

Related

Multiprocessing In Django Function

Dynamically set bigquery table id in dataflow pipeline

Fetching data in realtime from database in python

Slowly Changing Lookup Cache from BigQuery - Dataflow Python Streaming SDK

Is it possible to have both Pub/Sub and BigQuery as inputs in Google Dataflow?

Categories

Resources