Python Apache Beam: write results with given file names - python

I have a custom function that gets input from files, and I want to write the function results to files corresponding to source files.
The code is like:
the custom function:
def func(_file):
result = ...
return (_file.metadata.path, result)
the pipeline:
pcoll_of_file_pattern
| "Search files" >> MatchAll()
| "Read files" >> ReadMatches()
| 'Apply function to files' >> beam.Map(func)
| beam.Map( ??? )
The names of source files are needed to create result files, but WriteToText do not accepts the output of func (source_file_name, function results).
Say I have files test-1.txt, test-2.txt, ... text-10.txt, the search pattern would be test-*.txt. What I want is saving the results like result-test-1.txt, result-test-2.txt, ...result-test-10.txt.
How can I achieve that?

Not 100% sure this is what you want, but this may be helpful as an example
from apache_beam.io import filesystems
def fake_processing(element):
now_str = datetime.now().strftime('%Y-%m-%d %H:%M:%S')
return (element, now_str)
class WriteToSeparateFiles(beam.DoFn):
def process(self, element):
path = element[0]
writer = filesystems.FileSystems.create(f'Output/{path}.txt')
writer.write(bytes(element[1], 'utf-8'))
writer.close()
with beam.Pipeline() as p:
(p | "Create" >> beam.Create(['file1', 'file2'])
| beam.Map(fake_processing)
| "WriteToFileDynamic" >> beam.ParDo(WriteToSeparateFiles()))

Related

PubSub to Cloud Storage using DataFlow is slow

I am using a slightly adjusted version of the following example snippet to write PubSub messages to GCS:
class WriteToGCS(beam.DoFn):
def __init__(self, output_path, prefix):
self.output_path = output_path
self.prefix = prefix
def process(self, key_value, window=beam.DoFn.WindowParam):
"""Write messages in a batch to Google Cloud Storage."""
start_date = window.start.to_utc_datetime().date().isoformat()
start = window.start.to_utc_datetime().isoformat()
end = window.end.to_utc_datetime().isoformat()
shard_id, batch = key_value
filename = f'{self.output_path}/{start_date}/{start}-{end}/{self.prefix}-{start}-{end}-{shard_id:03d}'
with beam.io.gcsio.GcsIO().open(filename=filename, mode="w") as f:
for message_body in batch:
f.write(f"{message_body},".encode("utf-8"))
However, its terribly slow. This is how it looks like in the graph. Is there a way to speed up this step? The subscription gets 500 elements per second, so 3-10 elements per seconds is not keeping up.
The pipeline looks like this:
class JsonWriter(beam.PTransform):
def __init__(self, window_size, path, prefix, num_shards=20):
self.window_size = int(window_size)
self.path = path
self.prefix = prefix
self.num_shards = num_shards
def expand(self, pcoll):
return (
pcoll
| "Group into fixed windows" >> beam.WindowInto(window.FixedWindows(self.window_size, 0))
| "Decode windowed elements" >> beam.ParDo(Decode())
| "Group into batches" >> beam.BatchElements()
| "Add key" >> beam.WithKeys(lambda _: random.randint(0, int(self.num_shards) - 1))
| "Write to GCS" >> beam.ParDo(WriteToGCS(self.path, self.prefix))
)
Solved this with these steps:
| "Add Path" >> beam.ParDo(AddFilename(self.path, self.prefix, self.num_shards))
| "Group Paths" >> beam.GroupByKey()
| "Join and encode" >> beam.ParDo(ToOneString())
| "Write to GCS" >> beam.ParDo(WriteToGCS())
First I add the paths per element, then I group by path (to prevent simultaneous writes to the same file/shard). Then I concatenate (new line delimited) the JSON strings per group (path) and then write the whole string to the file. This reduces the calls to GCS tremendously, since I dont write every element one by one to a file, but instead join them first and then write once. Hope it helps someone.

Slowly Changing Lookup Cache from BigQuery - Dataflow Python Streaming SDK

I am trying to follow the design pattern for Slowly Changing Lookup Cache (https://cloud.google.com/blog/products/gcp/guide-to-common-cloud-dataflow-use-case-patterns-part-1) for a streaming pipeline using the Python SDK for Apache Beam on DataFlow.
Our reference table for the lookup cache sits in BigQuery, and we are able to read and pass it in as a Side Input to the ParDo operation but it does not refresh regardless of how we set up the trigger/windows.
class FilterAlertDoFn(beam.DoFn):
def process(self, element, alertlist):
print len(alertlist)
print alertlist
… # function logic
alert_input = (p | beam.io.Read(beam.io.BigQuerySource(query=ALERT_QUERY))
| ‘alert_side_input’ >> beam.WindowInto(
beam.window.GlobalWindows(),
trigger=trigger.RepeatedlyTrigger(trigger.AfterWatermark(
late=trigger.AfterCount(1)
)),
accumulation_mode=trigger.AccumulationMode.ACCUMULATING
)
| beam.Map(lambda elem: elem[‘SOMEKEY’])
)
...
main_input | ‘alerts’ >> beam.ParDo(FilterAlertDoFn(), beam.pvalue.AsList(alert_input))
Based on the I/O page here (https://beam.apache.org/documentation/io/built-in/) it says Python SDK supports streaming for the BigQuery Sink only, does that mean that BQ reads are a bounded source and therefore can’t be refreshed in this method?
Trying to set non-global windows on the source results in an empty PCollection in the Side Input.
UPDATE:
When trying to implement the strategy suggested by Pablo's answer, the ParDo operation that uses the side input wont run.
There is a single input source that goes to two output's, one of then using the Side Input. The Non-SideInput will still reach it's destination and the SideInput pipeline wont enter the FilterAlertDoFn().
By substituting the side input for a dummy value the pipeline will enter the function. Is it perhaps waiting for a suitable window that doesn't exist?
With the same FilterAlertDoFn() as above, my side_input and call now look like this:
def refresh_side_input(_):
query = 'select col from table'
client = bigquery.Client(project='gcp-project')
query_job = client.query(query)
return query_job.result()
trigger_input = ( p | 'alert_ref_trigger' >> beam.io.ReadFromPubSub(
subscription=known_args.trigger_subscription))
bigquery_side_input = beam.pvalue.AsSingleton((trigger_input
| beam.WindowInto(beam.window.GlobalWindows(),
trigger=trigger.Repeatedly(trigger.AfterCount(1)),
accumulation_mode=trigger.AccumulationMode.DISCARDING)
| beam.Map(refresh_side_input)
))
...
# Passing this as side input doesn't work
main_input | 'alerts' >> beam.ParDo(FilterAlertDoFn(), bigquery_side_input)
# Passing dummy variable as side input does work
main_input | 'alerts' >> beam.ParDo(FilterAlertDoFn(), [1])
I tried a few different versions of refresh_side_input(), They report the expect result when checking the return inside the function.
UPDATE 2:
I made some minor modifications to Pablo's code, and I get the same behaviour - the DoFn never executes.
In the below example I will see 'in_load_conversion_data' whenever I post to some_other_topic but will never see 'in_DoFn' when posting to some_topic
import apache_beam as beam
import apache_beam.transforms.window as window
from apache_beam.transforms import trigger
from apache_beam.options.pipeline_options import PipelineOptions
from apache_beam.options.pipeline_options import SetupOptions
from apache_beam.options.pipeline_options import StandardOptions
def load_my_conversion_data():
return {'EURUSD': 1.1, 'USDMXN': 4.4}
def load_conversion_data(_):
# I will suppose that these are currency conversions. E.g.
# {'EURUSD': 1.1, 'USDMXN' 20,}
print 'in_load_conversion_data'
return load_my_conversion_data()
class ConvertTo(beam.DoFn):
def __init__(self, target_currency):
self.target_currency = target_currency
def process(self, elm, rates):
print 'in_DoFn'
elm = elm.attributes
if elm['currency'] == self.target_currency:
yield elm
elif ' % s % s' % (elm['currency'], self.target_currency) in rates:
rate = rates[' % s % s' % (elm['currency'], self.target_currency)]
result = {}.update(elm).update({'currency': self.target_currency,
'value': elm['value']*rate})
yield result
else:
return # We drop that value
pipeline_options = PipelineOptions()
pipeline_options.view_as(StandardOptions).streaming = True
p = beam.Pipeline(options=pipeline_options)
some_topic = 'projects/some_project/topics/some_topic'
some_other_topic = 'projects/some_project/topics/some_other_topic'
with beam.Pipeline(options=pipeline_options) as p:
table_pcv = beam.pvalue.AsSingleton((
p
| 'some_other_topic' >> beam.io.ReadFromPubSub(topic=some_other_topic, with_attributes=True)
| 'some_other_window' >> beam.WindowInto(window.GlobalWindows(),
trigger=trigger.Repeatedly(trigger.AfterCount(1)),
accumulation_mode=trigger.AccumulationMode.DISCARDING)
| beam.Map(load_conversion_data)))
_ = (p | 'some_topic' >> beam.io.ReadFromPubSub(topic=some_topic)
| 'some_window' >> beam.WindowInto(window.FixedWindows(1))
| beam.ParDo(ConvertTo('USD'), rates=table_pcv))
As you well point out, the Java SDK allows you to use more streaming utilites such as timers and state. These utilities help the implementation of pipelines like these.
The Python SDK lacks some of these utilities, and specifically timers. For that reason, we need to use a hack, where the reload of the side input can be triggered by inserting messages into our some_other_topic in PubSub.
This also means that you have to manually perform a lookup into BigQuery. You can probably use the apache_beam.io.gcp.bigquery_tools.BigQueryWrapper class to perform lookups directly into BigQuery.
Here is an example of a pipeline that refreshes some currency conversion data. I haven't tested it, but I'm 90% sure it'll work with only few adjustments. Let me know if this helps.
pipeline_options = PipelineOptions()
p = beam.Pipeline(options=pipeline_options)
def load_conversion_data(_):
# I will suppose that these are currency conversions. E.g.
# {‘EURUSD’: 1.1, ‘USDMXN’ 20, …}
return external_service.load_my_conversion_data()
table_pcv = beam.pvalue.AsSingleton((
p
| beam.io.gcp.ReadFromPubSub(topic=some_other_topic)
| WindowInto(window.GlobalWindow(),
trigger=trigger.Repeatedly(trigger.AfterCount(1),
accumulation_mode=trigger.AccumulationMode.DISCARDING)
| beam.Map(load_conversion_data)))
class ConvertTo(beam.DoFn):
def __init__(self, target_currency):
self.target_currenct = target_currency
def process(self, elm, rates):
if elm[‘currency’] == self.target_currency:
yield elm
elif ‘%s%s’ % (elm[‘currency’], self.target_currency) in rates:
rate = rates[‘%s%s’ % (elm[‘currency’], self.target_currency)]
result = {}.update(elm).update({‘currency’: self.target_currency,
‘value’: elm[‘value’]*rate})
yield result
else:
return # We drop that value
_ = (p
| beam.io.gcp.ReadFromPubSub(topic=some_topic)
| beam.WindowInto(window.FixedWindows(1))
| beam.ParDo(ConvertTo(‘USD’), rates=table_pcv))

Dataflow/apache beam - how to access current filename when passing in pattern?

I have seen this question answered before on stack overflow (https://stackoverflow.com/questions/29983621/how-to-get-filename-when-using-file-pattern-match-in-google-cloud-dataflow), but not since apache beam has added splittable dofn functionality for python. How would I access the filename of the current file being processed when passing in a file pattern to a gcs bucket?
I want to pass the filename into my transform function:
with beam.Pipeline(options=pipeline_options) as p:
lines = p | ReadFromText('gs://url to file')
data = (
lines
| 'Jsonify' >> beam.Map(jsonify)
| 'Unnest' >> beam.FlatMap(unnest)
| 'Write to BQ' >> beam.io.Write(beam.io.BigQuerySink(
'project_id:dataset_id.table_name', schema=schema,
create_disposition=beam.io.BigQueryDisposition.CREATE_IF_NEEDED,
write_disposition=beam.io.BigQueryDisposition.WRITE_APPEND)
)
Ultimately, what I want to do is pass the filename into my transform function when I transform each row of the json (see this and then use the filename to do a lookup in a different BQ table to get a value). I think once I manage to know how to get the filename I will be able to figure out the side input part in order to do the lookup in the bq table and get the unique value.
I tried to implement a solution with the previously cited case. There, as well as in other approaches such as this one they also get a list of file names but load all the file into a single element which might not scale well with large files. Therefore, I looked into adding the filename to each record.
As input I used two csv files:
$ gsutil cat gs://$BUCKET/countries1.csv
id,country
1,sweden
2,spain
gsutil cat gs://$BUCKET/countries2.csv
id,country
3,italy
4,france
Using GCSFileSystem.match we can access metadata_list to retrieve FileMetadata containing the file path and size in bytes. In my example:
[FileMetadata(gs://BUCKET_NAME/countries1.csv, 29),
FileMetadata(gs://BUCKET_NAME/countries2.csv, 29)]
The code is:
result = [m.metadata_list for m in gcs.match(['gs://{}/countries*'.format(BUCKET)])]
We will read each of the matching files into a different PCollection. As we don't know the number of files a priori we need to create programmatically a list of names for each PCollection (p0, p1, ..., pN-1) and ensure that we have unique labels for each step ('Read file 0', 'Read file 1', etc.):
variables = ['p{}'.format(i) for i in range(len(result))]
read_labels = ['Read file {}'.format(i) for i in range(len(result))]
add_filename_labels = ['Add filename {}'.format(i) for i in range(len(result))]
Then we proceed to read each different file into its corresponding PCollection with ReadFromText and then we call the AddFilenamesFn ParDo to associate each record with the filename.
for i in range(len(result)):
globals()[variables[i]] = p | read_labels[i] >> ReadFromText(result[i].path) | add_filename_labels[i] >> beam.ParDo(AddFilenamesFn(), result[i].path)
where AddFilenamesFn is:
class AddFilenamesFn(beam.DoFn):
"""ParDo to output a dict with filename and row"""
def process(self, element, file_path):
file_name = file_path.split("/")[-1]
yield {'filename':file_name, 'row':element}
My first approach was using a Map function directly which results in simpler code. However, result[i].path was resolved at the end of the loop and each record was incorrectly mapped to the last file of the list:
globals()[variables[i]] = p | read_labels[i] >> ReadFromText(result[i].path) | add_filename_labels[i] >> beam.Map(lambda elem: (result[i].path, elem))
Finally, we flatten all the PCollections into one:
merged = [globals()[variables[i]] for i in range(len(result))] | 'Flatten PCollections' >> beam.Flatten()
and we check the results by logging the elements:
INFO:root:{'filename': u'countries2.csv', 'row': u'id,country'}
INFO:root:{'filename': u'countries2.csv', 'row': u'3,italy'}
INFO:root:{'filename': u'countries2.csv', 'row': u'4,france'}
INFO:root:{'filename': u'countries1.csv', 'row': u'id,country'}
INFO:root:{'filename': u'countries1.csv', 'row': u'1,sweden'}
INFO:root:{'filename': u'countries1.csv', 'row': u'2,spain'}
I tested this with both DirectRunner and DataflowRunner for Python SDK 2.8.0.
I hope this addresses the main issue here and you can continue by integrating BigQuery into your full use case now. You might need to use the Python Client Library for that, I wrote a similar Java example.
Full code:
import argparse, logging
from operator import add
import apache_beam as beam
from apache_beam.options.pipeline_options import PipelineOptions
from apache_beam.io import ReadFromText
from apache_beam.io.filesystem import FileMetadata
from apache_beam.io.filesystem import FileSystem
from apache_beam.io.gcp.gcsfilesystem import GCSFileSystem
class GCSFileReader:
"""Helper class to read gcs files"""
def __init__(self, gcs):
self.gcs = gcs
class AddFilenamesFn(beam.DoFn):
"""ParDo to output a dict with filename and row"""
def process(self, element, file_path):
file_name = file_path.split("/")[-1]
# yield (file_name, element) # use this to return a tuple instead
yield {'filename':file_name, 'row':element}
# just logging output to visualize results
def write_res(element):
logging.info(element)
return element
def run(argv=None):
parser = argparse.ArgumentParser()
known_args, pipeline_args = parser.parse_known_args(argv)
p = beam.Pipeline(options=PipelineOptions(pipeline_args))
gcs = GCSFileSystem(PipelineOptions(pipeline_args))
gcs_reader = GCSFileReader(gcs)
# in my case I am looking for files that start with 'countries'
BUCKET='BUCKET_NAME'
result = [m.metadata_list for m in gcs.match(['gs://{}/countries*'.format(BUCKET)])]
result = reduce(add, result)
# create each input PCollection name and unique step labels
variables = ['p{}'.format(i) for i in range(len(result))]
read_labels = ['Read file {}'.format(i) for i in range(len(result))]
add_filename_labels = ['Add filename {}'.format(i) for i in range(len(result))]
# load each input file into a separate PCollection and add filename to each row
for i in range(len(result)):
# globals()[variables[i]] = p | read_labels[i] >> ReadFromText(result[i].path) | add_filename_labels[i] >> beam.Map(lambda elem: (result[i].path, elem))
globals()[variables[i]] = p | read_labels[i] >> ReadFromText(result[i].path) | add_filename_labels[i] >> beam.ParDo(AddFilenamesFn(), result[i].path)
# flatten all PCollections into a single one
merged = [globals()[variables[i]] for i in range(len(result))] | 'Flatten PCollections' >> beam.Flatten() | 'Write results' >> beam.Map(write_res)
p.run()
if __name__ == '__main__':
run()
I had to read some metadata files and use the filename for further processing.
I struggled when I finally came across apache_beam.io.ReadFromTextWithFilename
def run(argv=None, save_main_session=True):
import typing
import apache_beam as beam
from apache_beam.options.pipeline_options import PipelineOptions
from apache_beam.io import ReadFromTextWithFilename
class ExtractMetaData(beam.DoFn):
def process(self, element):
filename, meta = element
image_name = filename.split("/")[-2]
labels = json.loads(meta)["labels"]
image = {"image_name": image_name, "labels": labels}
print(image)
return image
parser = argparse.ArgumentParser()
known_args, pipeline_args = parser.parse_known_args(argv)
pipeline_options = PipelineOptions(pipeline_args)
with beam.Pipeline(options=pipeline_options) as pipeline:
meta = (
pipeline
| "Read Metadata" >> ReadFromTextWithFilename(f'gs://{BUCKET}/dev-set/**/*metadata.json')
| beam.ParDo(ExtractMetaData())
)
pipeline.run()

Conditional statement Python Apache Beam pipeline

Current situation
The porpouse of this pipeline is to read from pub/sub the payload with geodata, then this data are transformed and analyzed and finally return if a condition is true or false
with beam.Pipeline(options=pipeline_options) as p:
raw_data = (p
| 'Read from PubSub' >> beam.io.ReadFromPubSub(
subscription='projects/XXX/subscriptions/YYY'))
geo_data = (raw_data
| 'Geo data transform' >> beam.Map(lambda s: GeoDataIngestion(s)))
def GeoDataIngestion(string_input):
<...>
return True or False
Desirable situation 1
If the GeoDataIngestion result is true, then the raw_data will be stored in big query
geo_data = (raw_data
| 'Geo data transform' >> beam.Map(lambda s: GeoDataIngestion(s))
| 'Evaluate condition' >> beam.Map(lambda s: Condition(s))
)
def Condition(condition):
if condition:
<...WriteToBigQuery...>
#The class I used before to store raw_data without depending on evaluate condition:
class WriteToBigQuery(beam.PTransform):
def expand(self, pcoll):
return (
pcoll
| 'Format' >> beam.ParDo(FormatBigQueryFn())
| 'Write to BigQuery' >> beam.io.WriteToBigQuery(
'XXX',
schema=TABLE_SCHEMA,
create_disposition=beam.io.BigQueryDisposition.CREATE_IF_NEEDED,
write_disposition=beam.io.BigQueryDisposition.WRITE_APPEND))
Desirable situation 2
Instead of store the data in BigQuery, it would be also good to send to pub/sub
def Condition(condition):
if condition:
<...SendToPubSub(Topic1)...>
else:
<...SendToPubSub(Topic2)...>
Here, the problem is to set the Topic depending of the condition result, because i'm not able to pass the topic like parameter in the pipeline
| beam.io.WriteStringsToPubSub(TOPIC)
Neither in a function/class
Question
How can I do that?
How/where should I call WriteToBigQuery to store the PCollection raw_data if the result of Evaluate condition is true?
I think branching collections based on the evaluation condition result might be helpful for your scenario. Please see the documentation here.
To illustrate the branching, suppose I have a collection below, where you want to do different action based on the content of the string.
'this line is for BigQuery',
'this line for pubsub topic1',
'this line for pubsub topic2'
The code below will create tag the collection and you can get three different PCollections based on the tag. Then you can decide what further actions you want to perform on the individual collections.
import apache_beam as beam
from apache_beam import pvalue
import sys
class Split(beam.DoFn):
# These tags will be used to tag the outputs of this DoFn.
OUTPUT_TAG_BQ = 'BigQuery'
OUTPUT_TAG_PS1 = 'pubsub topic1'
OUTPUT_TAG_PS2 = 'pubsub topic2'
def process(self, element):
"""
tags the input as it processes the orignal PCollection
"""
print element
if "BigQuery" in element:
yield pvalue.TaggedOutput(self.OUTPUT_TAG_BQ, element)
print 'found bq'
elif "pubsub topic1" in element:
yield pvalue.TaggedOutput(self.OUTPUT_TAG_PS1, element)
elif "pubsub topic2" in element:
yield pvalue.TaggedOutput(self.OUTPUT_TAG_PS2, element)
if __name__ == '__main__':
output_prefix = 'C:\\pythonVirtual\\Mycodes\\output'
p = beam.Pipeline(argv=sys.argv)
lines = (p
| beam.Create([
'this line is for BigQuery',
'this line for pubsub topic1',
'this line for pubsub topic2']))
# with_outputs allows accessing the explicitly tagged outputs of a DoFn.
tagged_lines_result = (lines
| beam.ParDo(Split()).with_outputs(
Split.OUTPUT_TAG_BQ,
Split.OUTPUT_TAG_PS1,
Split.OUTPUT_TAG_PS2))
# tagged_lines_result is an object of type DoOutputsTuple. It supports
# accessing result in alternative ways.
bq_records = tagged_lines_result[Split.OUTPUT_TAG_BQ]| "write BQ" >> beam.io.WriteToText(output_prefix + 'bq')
ps1_records = tagged_lines_result[Split.OUTPUT_TAG_PS1] | "write PS1" >> beam.io.WriteToText(output_prefix + 'ps1')
ps2_records = tagged_lines_result[Split.OUTPUT_TAG_PS2] | "write PS2" >> beam.io.WriteToText(output_prefix + 'ps2')
p.run().wait_until_finish()
Please let me know if that helps.

Python: Automatic sharding in FileSystems.create() for a writing channel in Dataflow

I was using FileSystems inside a ParDo to be able to write to dynamic destinations in data storage.
However I was not able to do automatic sharding as in Text.IO using wildcard for filename?
Is there a way that I can do automatic sharding in FileSystems.create?
Edited:
This is the pipeline I used to run, the part of code in question is WriteToStorage where I want to write the result to date{week}at{year}/results*.json
with beam.Pipeline(options=pipeline_options) as p:
pcoll = (p | ReadFromText(known_args.input)
| beam.ParDo(WriteDecompressedFile())
| beam.Map(lambda x: ('{week}at{year}'.format(week=x['week'], year=x['year']), x))
| beam.GroupByKey()
| beam.ParDo(WriteToStorage()))
Here's the current version of WriteToStorage()
class WriteToStorage(beam.DoFn):
def __init__(self):
self.writer = None
def process(self, element):
(key, val) = element
week, year = [int(x) for x in key.split('at')]
if self.writer == None:
path = known_args.output + 'date-{week}at{year}/revisions-from-{rev}.json'.format(week=week, year=year, rev=element['rev_id'])
self.writer = filesystems.FileSystems.create(path)
logging.info('USERLOG: Write to path %s.'%path)
logging.info('TESTLOG: %s.'%type(val))
for output in val:
self.writer.write(json.dumps(output) + '\n')
def finish_bundle(self):
if not(self.writer == None):
self.writer.close()
Thank you.
You can use the StartBundle() method of your DoFn to open a connection to a new file for each worker.
However, you have to figure out a way to independently name your files. Text.IO seems to do this with _RoundRobinKeyFn.
A simpler way would be to use a timestamp for name generation, but I don't know how foolproof this would be.

Categories

Resources