How to read multiple files in Apache Beam from GCP bucket - python

I am trying to reading and apply some subsetting on multiple files in GCP with Apache Beam. I prepared two pipelines which work for only one file, but fail when I try them on multiple files. Apart from this, I would be handy to combine my pipelines into one if possible or is there a way to orchestrate them so that they work in order. Now the pipelines work locally, but my ultimate goal is to run them with Dataflow.
I textio.ReadFromText and textio.ReadAllFromText, but I couldn't make neither work in case of multiple files.
def toJson(file):
with open(file) as f:
return json.load(f)
with beam.Pipeline(options=PipelineOptions()) as p:
files = (p
| beam.io.textio.ReadFromText("gs://my_bucket/file1.txt.gz", skip_header_lines = 0)
| beam.io.WriteToText("/home/test",
file_name_suffix=".json", num_shards=1 , append_trailing_newlines = True))
with beam.Pipeline(options=PipelineOptions()) as p:
lines = (p
| 'read_data' >> beam.Create(['test-00000-of-00001.json'])
| "toJson" >> beam.Map(toJson)
| "takeItems" >> beam.FlatMap(lambda line: line["Items"])
| "takeSubjects" >> beam.FlatMap(lambda line: line['data']['subjects'])
| beam.combiners.Count.PerElement()
| beam.io.WriteToText("/home/items",
file_name_suffix=".txt", num_shards=1 , append_trailing_newlines = True))
These two pipelines work well for a single file, but I have hundred files in the same format and would like to use the advantages of parallel computing.
Is there a way to make this pipeline work for multiple files under the same directory?
Is it possible to do this within a single pipe instead of creating two different pipelines? (It is not handy to write files to worker nodes from bucket.)

I solved how to make it work for multiple files but couldn't make it run within a single pipeline though. I used for loop and then beam.Flatten option.
Here is my solution:
file_list = ["gs://my_bucket/file*.txt.gz"]
res_list = ["/home/subject_test_{}-00000-of-00001.json".format(i) for i in range(len(file_list))]
with beam.Pipeline(options=PipelineOptions()) as p:
for i,file in enumerate(file_list):
(p
| "Read Text {}".format(i) >> beam.io.textio.ReadFromText(file, skip_header_lines = 0)
| "Write TExt {}".format(i) >> beam.io.WriteToText("/home/subject_test_{}".format(i),
file_name_suffix=".json", num_shards=1 , append_trailing_newlines = True))
pcols = []
with beam.Pipeline(options=PipelineOptions()) as p:
for i,res in enumerate(res_list):
pcol = (p | 'read_data_{}'.format(i) >> beam.Create([res])
| "toJson_{}".format(i) >> beam.Map(toJson)
| "takeItems_{}".format(i) >> beam.FlatMap(lambda line: line["Items"])
| "takeSubjects_{}".format(i) >> beam.FlatMap(lambda line: line['data']['subjects']))
pcols.append(pcol)
out = (pcols
| beam.Flatten()
| beam.combiners.Count.PerElement()
| beam.io.WriteToText("/home/items",
file_name_suffix=".txt", num_shards=1 , append_trailing_newlines = True))

Related

Writing Apache Beam Tagged Output (Dataflow runner) to different BQ tables

It seems that I get an issue writing tagged PCollections to multiple destination tables in BQ. The pipeline executes with no errors, but no data gets written.
If I execute the pipeline without TaggedOutput, PCollection elements are correctly generated and correctly written to the BQ table on its own (albeit a single table, instead of multiple). So I believe the issue is misunderstanding how TaggedOutput actually works?
Code
I have a process fn which generated tagged output:
class ProcessFn(beam.DoFn):
def process(self, el):
if el > 5:
yield TaggedOutput('more_than_5', el)
else:
yield TaggedOutput('less_than_5', el)
And the pipeline:
with beam.Pipeline(options=beam_options) as p:
# Read the table rows into a PCollection.
results = (
p
| "read" >> beam.io.ReadFromBigQuery(table=args.input_table, use_standard_sql=True)
| "process rows" >> beam.ParDo(ProcessFn()).with_outputs(
'more_than_5',
main='less_than_5')
)
results.less_than_5 | "write to bq 1" >> beam.io.WriteToBigQuery(
'dataset.less_than_5',
schema=less_than_5_schema,
create_disposition=beam.io.BigQueryDisposition.CREATE_IF_NEEDED,
write_disposition=beam.io.BigQueryDisposition.WRITE_APPEND,
)
results.more_than_5 | "write to bq 2" >> beam.io.WriteToBigQuery(
'dataset.more_than_5',
schema=more_than_5_schema,
create_disposition=beam.io.BigQueryDisposition.CREATE_IF_NEEDED,
write_disposition=beam.io.BigQueryDisposition.WRITE_APPEND,
)
The with_outputs(main=...) keyword is used for yields without the TaggedOutput. In this case, you should probably be writing with_outputs('more_than_5', 'less_than_5'). Either accessing the result by name or unpacking as a tuple should work.
I think the problem is due to the way getting result with multi sinks in your code.
The result should be retrieved as a Tuple :
results_less_than_5, result_more_than_5 = (
p
| "read" >> beam.io.ReadFromBigQuery(table=args.input_table, use_standard_sql=True)
| "process rows" >> beam.ParDo(ProcessFn()).with_outputs(
'more_than_5',
main='less_than_5')
)
results_less_than_5 | "write to bq 1" >> beam.io.WriteToBigQuery(
'dataset.less_than_5',
schema=less_than_5_schema,
create_disposition=beam.io.BigQueryDisposition.CREATE_IF_NEEDED,
write_disposition=beam.io.BigQueryDisposition.WRITE_APPEND,
)
result_more_than_5 | "write to bq 2" >> beam.io.WriteToBigQuery(
'dataset.more_than_5',
schema=more_than_5_schema,
create_disposition=beam.io.BigQueryDisposition.CREATE_IF_NEEDED,
write_disposition=beam.io.BigQueryDisposition.WRITE_APPEND,
)
Can you try with this syntax ?

Is there a way to read a multi-line csv file in Apache Beam using the ReadFromText transform (Python)?

Is there a way to read a multi-line csv file using the ReadFromText transform in Python? I have a file that contains one line I am trying to make Apache Beam read the input as one line, but cannot get it to work.
def print_each_line(line):
print line
path = './input/testfile.csv'
# Here are the contents of testfile.csv
# foo,bar,"blah blah
# more blah blah",baz
p = apache_beam.Pipeline()
(p
| 'ReadFromFile' >> apache_beam.io.ReadFromText(path)
| 'PrintEachLine' >> apache_beam.FlatMap(lambda line: print_each_line(line))
)
# Here is the output:
# foo,bar,"blah blah
# more blah blah",baz
The above code parses the input as two lines even though the standard for multi-line csv files is to wrap multi-line elements within double-quotes.
Beam doesn't support parsing CSV files. You can however use Python's csv.reader. Here's an example:
import apache_beam
import csv
def print_each_line(line):
print line
p = apache_beam.Pipeline()
(p
| apache_beam.Create(["test.csv"])
| apache_beam.FlatMap(lambda filename:
csv.reader(apache_beam.io.filesystems.FileSystems.open(filename)))
| apache_beam.FlatMap(print_each_line))
p.run()
Output:
['foo', 'bar', 'blah blah\nmore blah blah', 'baz']
None of the answers worked for me but this did
(
p
| beam.Create(['data/test.csv'])
| beam.FlatMap(lambda filename:
csv.reader(io.TextIOWrapper(beam.io.filesystems.FileSystems.open(known_args.input)))
| "Take only name" >> beam.Map(lambda x: x[0])
| WriteToText(known_args.output)
)
ReadFromText parses a text file as newline-delimited elements. So ReadFromText treats two lines as two elements. If you would like to have the contents of the file as a single element, you could do the following:
contents = []
contents.append(open(path).read())
p = apache_beam.Pipeline()
p | beam.Create(contents)

Modify MinimalWordCount example to read from BigQuery

I am trying to modify Apache Beam's MinimalWordCount python example to read from a BigQuery table. I have made the following modifications, and I appear to have the query working but the example.
Original Example Here:
with beam.Pipeline(options=pipeline_options) as p:
# Read the text file[pattern] into a PCollection.
lines = p | ReadFromText(known_args.input)
# Count the occurrences of each word.
counts = (
lines
| 'Split' >> (beam.FlatMap(lambda x: re.findall(r'[A-Za-z\']+', x))
.with_output_types(unicode))
| 'PairWithOne' >> beam.Map(lambda x: (x, 1))
| 'GroupAndSum' >> beam.CombinePerKey(sum))
# Format the counts into a PCollection of strings.
output = counts | 'Format' >> beam.Map(lambda (w, c): '%s: %s' % (w, c))
# Write the output using a "Write" transform that has side effects.
# pylint: disable=expression-not-assigned
output | WriteToText(known_args.output)
Rather than ReadFromText I am trying to adjust this to read from a column in a BigQuery table. To do this I have replaced lines = p | ReadFromText(known_args.input) with the following code:
query = 'SELECT text_column FROM `bigquery.table.goes.here` '
lines = p | 'ReadFromBigQuery' >> beam.io.Read(beam.io.BigQuerySource(query=query, use_standard_sql=True))
When I re-run the pipeline, I get the error: "WARNING:root:A task failed with exception. expected string or buffer [while running 'Split']"
I recognize that the 'Split' operation is expecting a string and it is clearly not getting a string. How can I modify 'ReadFromBigQuery' so that it is passing a string/buffer? Do I need to provide a table schema or something to convert the results of 'ReadFromBigQuery' into a buffer of strings?
This is because BigQuerySource returns PCollection of dictionaries (dict), where every key in the dictionary represents a column. For your case the simplest thing to do will be just applying beam.Map after beam.io.Read(beam.io.BigQuerySource(query=query, use_standard_sql=True) like this:
lines = (p
|"ReadFromBigQuery" >> beam.io.Read(beam.io.BigQuerySource(query=query, use_standard_sql=True))
| "Extract text column" >> beam.Map(lambda row: row.get("text_column"))
)
If you encounter problem with column name, try change it to u"text_column".
Alternatively you can modify your Split transform to extract the value of column there:
'Split' >> (beam.FlatMap(lambda x: re.findall(r'[A-Za-z\']+', x.get("text_column")))
.with_output_types(unicode))

Clean up string extracted from csv file

I am extracting certain data from a csv file using Ruby and I want to cleanup the extracted string by removing the unwanted characters.
This is how I extract the data so far:
CSV.foreach(data_file, :encoding => 'windows-1251:utf-8', :headers => true) do |row|
#create an array for each page
page_data = []
#For each page, get the data we are interested in and save it to the page_data
page_data.push(row['dID'])
page_data.push(row['xTerm'])
pages_to_import.push(page_data)
Then I output the csv file with the extracted data
The output extracted is exactly as it is on the csv data file:
| ID | Term |
|-------|-----------------------------------------|
| 13241 | ##106#107#my##106#term## |
| 13345 | ##63#hello## |
| 11436 | ##55#rock##20#my##10015#18#world## |
However, My desired result that I want to achieve is:
| ID | Term |
|-------|-----------------------------------------|
| 13241 | my, term |
| 13345 | hello |
| 11436 | rock, my, world |
Any suggestions on how to achieve this?
Libraries that Im using:
require 'nokogiri'
require 'cgi'
require 'csv'
Using a regular expression, I'd do:
%w[
##106#107#term1##106#term2##
##63#term1##
##55#term1##20#term2##10015#18#term3##
##106#107#my##106#term##
##63#hello##
##55#rock##20#my##10015#18#world##
].map{ |str|
str.scan(/[^##]+?)(?=#/)
}
# => [["term1", "term2"], ["term1"], ["term1", "term2", "term3"], ["my", "term"], ["hello"], ["rock", "my", "world"]]
My str is the equivalent of the contents of your row['xTerm'].
The regular expression /[^##]+?(?=#)/ searches for patterns in str that don't contain # or # and end with #.
From the garbage in the string, and your comment that you're using Nokogiri and CSV, and because you didn't show your input data as CSV or HTML, I have to wonder if you're not mangling the incoming data somehow, and trying to wiggle out of it in post-processing. If so, show us what you're actually doing and maybe we can help you get clean data to start.
I'm assuming your terms are bookended and separated by ## and consist of one or more numbers followed by the actual term separated by #. To get the terms into an array:
row['xTerm'].split('##')[1..-1].map { |term| term.split(?#)[-1] }
Then you can join or do whatever you want with it.

Extracting each line from a file and passing it as a variable to "foreach" loop

Could somebody help me figure out a simple way of doing this using any script ? I will be running the script on Linux
1 ) I have a file1 which has the following lines :
(Bank8GntR[3] | Bank8GntR[2] | Bank8GntR[1] | Bank8GntR[0] ),
(Bank7GntR[3] | Bank7GntR[2] | Bank7GntR[1] | Bank7GntR[0] ),
(Bank6GntR[3] | Bank6GntR[2] | Bank6GntR[1] | Bank6GntR[0] ),
(Bank5GntR[3] | Bank5GntR[2] | Bank5GntR[1] | Bank5GntR[0] ),
2 ) I need the contents of file1 to be modified as following and written to a file2
(Bank15GntR[3] | Bank15GntR[2] | Bank15GntR[1] | Bank15GntR[0] ),
(Bank14GntR[3] | Bank14GntR[2] | Bank14GntR[1] | Bank14GntR[0] ),
(Bank13GntR[3] | Bank13GntR[2] | Bank13GntR[1] | Bank13GntR[0] ),
(Bank12GntR[3] | Bank12GntR[2] | Bank12GntR[1] | Bank12GntR[0] ),
So I have to:
read each line from the file1,
use "search" using regular expression,
to match Bank[0-9]GntR,
replace \1 with "7 added to number matched",
insert it back into the line,
write the line into a new file.
How about something like this in Python:
# a function that adds 7 to a matched group.
# groups 1 and 2, we grabbed (Bank) to avoid catching the digits in brackets.
def plus7(matchobj):
return '%s%d' % (matchobj.group(1), int(matchobj.group(2)) + 7)
# iterate over the input file, have access to the output file.
with open('in.txt') as fhi, open('out.txt', 'w') as fho:
for line in fhi:
fho.write(re.sub('(Bank)(\d+)', plus7, line))
Assuming you don't have to use python, you can do this using awk:
cat test.txt | awk 'match($0, /Bank([0-9]+)GntR/, nums) { d=nums[1]+7; gsub(/Bank[0-9]+GntR\[/, "Bank" d "GntR["); print }'
This gives the desired output.
The point here is that match will match your data and allows capturing groups which you can use to extract out the number. As awk supports arithmetic, you can then add 7 within awk and then do a replacement on all the values in the rest of the line. Note, I've assumed all the values in the line have the same digit in them.

Categories

Resources