I followed the code snippet of Slowly updating side input using windowing example but it printed nothing.
It only runs side_input instead of the whole.
I use DirectRunner for this.
import apache_beam as beam
from apache_beam.transforms.periodicsequence import PeriodicImpulse
from apache_beam.transforms.window import TimestampedValue
from apache_beam.transforms import window
def cross_join(left, rights):
for x in rights:
yield left, x
if __name__ == '__main__':
data = list(range(1, 100))
pattern = 'pat'
main_interval = 10
side_interval = 5
pipeline = beam.Pipeline()
side_input = (
pipeline
| 'PeriodicImpulse' >> PeriodicImpulse(fire_interval=side_interval, apply_windowing=True)
| 'MapToFileName' >> beam.Map(lambda x: pattern + str(x)))
main_input = (
pipeline
| 'MpImpulse' >> beam.Create(data)
| 'MapMpToTimestamped' >> beam.Map(lambda src: TimestampedValue(src, src))
| 'WindowMpInto' >> beam.WindowInto(window.FixedWindows(main_interval)))
result = (
main_input
| 'ApplyCrossJoin' >> beam.FlatMap(cross_join, rights=beam.pvalue.AsIter(side_input))
| 'log' >> beam.Map(print))
res = pipeline.run()
res.wait_until_finish()
thanks and regards.
This is because the Python FnApiRunner does not yet support streaming mode. It is trying to run PeriodicImpulse to completion before executing the next stage of the pipeline. This is curently being worked on at BEAM-7514.
Unfortunately, it looks like the BundleBasedDirectRunner which can do streaming in Python has issues with PeriodicImpulse as well.
You could try using a "real" runner, e.g. Flink in local mode, by passing runner='Flink' to your pipeline creation.
Note also that your side input needs to be windowed as well, and the side and main inputs will be lined up by window.
Related
I'm trying to use Apache Beam to create a pipeline that -
Reads messages from a stream
Groups the messages by some id
Appends the message to a file with a corresponding id
The append operation is not atomic - it's composed of reading the file's content, appending the new messages in memory, and overriding the file with the new content. I'm using fixed windowing, and I noticed that sometimes I have two consecutive windows that are fired together which causes a race with writing the file.
A way to reproduce this behavior is to use FixedWindows(5) (fixed window of 5 seconds) with AfterProcessingTime(20) which causes a window to be fired only every 20 seconds. So, 4 windows contained in this 20 seconds period will be fired together. For example, look at the following code -
pipeline_options = PipelineOptions(
streaming=True
)
with beam.Pipeline(options=pipeline_options) as p:
class SleepAndCon(beam.DoFn):
def process(self, element, timestamp=beam.DoFn.TimestampParam, window=beam.DoFn.WindowParam):
window_start = str(window.start.to_utc_datetime()).split()[1]
window_end = str(window.end.to_utc_datetime()).split()[1]
timestamp = str(timestamp.to_utc_datetime()).split()[1]
curr_time = str(datetime.utcnow()).split()[1].split(".")[0]
print(f"sleep_and_con {curr_time}-{timestamp}-{window_start}-{window_end}")
return [element]
def print_end(e):
print("end")
(p
| beam.io.ReadFromPubSub(subscription="...")
| beam.Map(lambda x: ('a', x))
| beam.WindowInto(
window.FixedWindows(5),
trigger=Repeatedly(AfterProcessingTime(20)),
accumulation_mode=False
)
| beam.GroupByKey()
| beam.ParDo(SleepAndCon())
| beam.Map(print_end)
)
After running and sending 5 messages every 5 seconds, we got the following output:
sleep_and_con 13:58:19-13:57:39.999999-13:57:35-13:57:40
sleep_and_con 13:58:19-13:57:44.999999-13:57:40-13:57:45
sleep_and_con 13:58:19-13:57:49.999999-13:57:45-13:57:50
sleep_and_con 13:58:19-13:57:54.999999-13:57:50-13:57:55
sleep_and_con 13:58:19-13:57:59.999999-13:57:55-13:58:00
end
end
end
end
end
But this behavior happens to me also without the AfterProcessingTime trigger, randomly (I didn't manage to understand when exactly it happens)
My question is - is there a way to block a window trigger until a previous window finishes the pipeline ?
Thanks!
I want to check, if the CSV file we read in the pipeline of apache beam, satisfies the format I'm expecting it to be in [Ex: field check, type check, null value check, etc.], before performing any transformation.
Performing these checks outside the pipeline for every file will take away the concept of parallelism, so I just wanted to know if it was possible to perform it inside the pipeline.
An example of what the code might look like:
import apache_beam as beam
branched=beam.Pipeline()
class f_c(beam.DoFn):
def process(self, element):
if element == feld:
return True
else:
return False
input_collection = (
branched
| 'Read from text file' >> beam.io.ReadFromText("gs://***.csv")
| 'Split rows' >> beam.Map(lambda line: line.split(',')))
field_check=(input_collection
| 'field function returning True or False' >> beam.ParDo(f_c())
| beam.io.WriteToText('gs://***/op'))
branched.run().wait_unitl_finish
I replicated a small instance of the problem with with just two rules.
if the length of 3th column is >10 and value of 4th column is >500 then that record is good else its a bad record.
from apache_beam.io.textio import WriteToText
import apache_beam as beam
class FilterFn(beam.DoFn):
"""
The rules I have assumed is to just perform a length check on the third column and
value check on forth column
if(length of 3th column) >10 and (value of 4th column) is >500 then that record is good
"""
def process(self, text):
#------------------ -----isGoodRow BEGINS-------------------------------
def isGoodRow(a_list):
if( (len(a_list[2]) > 10) and (int(a_list[3]) >100) ):
return True
else:
return False
#------------------- ----isGoodRow ENDS-------------------------------
a_list = []
a_list = text.split(",") # this list contains all the column for a periticuar i/p Line
bool_result = isGoodRow(a_list)
if(bool_result == True):
yield beam.TaggedOutput('good', text)
else:
yield beam.TaggedOutput('bad', text)
with beam.Pipeline() as pipeline:
split_me = (
pipeline
| 'Read from text file' >> beam.io.ReadFromText("/content/sample.txt")
| 'Split words and Remove N/A' >> beam.ParDo(FilterFn()).with_outputs('good','bad')
)
good_collection = (
split_me.good
|"write good o/p" >> beam.io.WriteToText(file_path_prefix='/content/good',file_name_suffix='.txt')
)
bad_collection = (
split_me.bad
|"write bad o/p" >> beam.io.WriteToText(file_path_prefix='/content/bad',file_name_suffix='.txt')
)
I have written a ParDo FilterFn that Tags output based of isGoodRow function. You can rewrite the logic with your requirements ex. null checks, data validation etc.
The other option could be to use a Partition Function for this use case.
The sample.txt I used for this example:
1,Foo,FooBarFooBarFooBar,1000
2,Foo,Bar,10
3,Foo,FooBarFooBarFooBar,900
4,Foo,FooBar,800
Output of good.txt
1,Foo,FooBarFooBarFooBar,1000
3,Foo,FooBarFooBarFooBar,900
Output of bad.txt
2,Foo,Bar,10
4,Foo,FooBar,800
I'm trying to count kafka message key, by using direct runner.
If I put max_num_records =20 in ReadFromKafka, I can see results printed or outputed to text.
like:
('2102', 5)
('2706', 5)
('2103', 5)
('2707', 5)
But without max_num_records, or if max_num_records is larger than message count in kafka topic, the program keeps running but nothing is outputed.
If I try to output with beam.io.WriteToText, there will be an empty temp folder created, like:
beam-temp-StatOut-d16768eadec511eb8bd897b012f36e97
Terminal shows:
2.30.0: Pulling from apache/beam_java8_sdk
Digest: sha256:720144b98d9cb2bcb21c2c0741d693b2ed54f85181dbd9963ba0aa9653072c19
Status: Image is up to date for apache/beam_java8_sdk:2.30.0
docker.io/apache/beam_java8_sdk:2.30.0
If I put 'enable.auto.commit': 'true' in kafka consumer config, the messages are commited, other clients from the same group can't read them, so I assume it's reading succesfully, just not processing or outputing.
I tried Fixed-time, Sliding time windowing, with or without different trigger, nothing changes.
Tried flink runner, got same result as direct runner.
No idea what I did wrong, any help?
environment:
centos 7
anaconda
python 3.8.8
java 1.8.0_292
beam 2.30
code as below:
direct_options = PipelineOptions([
"--runner=DirectRunner",
"--environment_type=LOOPBACK",
"--streaming",
])
direct_options.view_as(SetupOptions).save_main_session = True
direct_options.view_as(StandardOptions).streaming = True
conf = {'bootstrap.servers': '192.168.75.158:9092',
'group.id': "g17",
'enable.auto.commit': 'false',
'auto.offset.reset': 'earliest'}
if __name__ == '__main__':
with beam.Pipeline(options = direct_options) as p:
msg_kv_bytes = ( p
| 'ReadKafka' >> ReadFromKafka(consumer_config=conf,topics=['LaneIn']))
messages = msg_kv_bytes | 'Decode' >> beam.MapTuple(lambda k, v: (k.decode('utf-8'), v.decode('utf-8')))
counts = (
messages
| beam.WindowInto(
window.FixedWindows(10),
trigger = AfterCount(1),#AfterCount(4),#AfterProcessingTime
# allowed_lateness=3,
accumulation_mode = AccumulationMode.ACCUMULATING) #ACCUMULATING #DISCARDING
# | 'Windowsing' >> beam.WindowInto(window.FixedWindows(10, 5))
| 'TakeKeyPairWithOne' >> beam.MapTuple(lambda k, v: (k, 1))
| 'Grouping' >> beam.GroupByKey()
| 'Sum' >> beam.MapTuple(lambda k, v: (k, sum(v)))
)
output = (
counts
| 'Print' >> beam.ParDo(print)
# | 'WriteText' >> beam.io.WriteToText('/home/StatOut',file_name_suffix='.txt')
)
There are couple of known issues that you might be running into.
Beam's portable DirectRunner currently does not fully support streaming. Relevant Jira to follow is https://issues.apache.org/jira/browse/BEAM-7514
Beam's portable runners (including DirectRunner) has a known issue where streaming sources do not properly emit messages. Hence max_num_records or max_read_time arguments have to be provided to convert such sources to bounded sources. Relevant Jira to follow is https://issues.apache.org/jira/browse/BEAM-11998.
need some help. I have some trivial task reading from Pub/Sub and write to batch file in GCS, but have some struggle with fileio.WriteToFiles
with beam.Pipeline(options=pipeline_options) as p:
input = (p | 'ReadData' >> beam.io.ReadFromPubSub(topic=known_args.input_topic).with_output_types(bytes)
| "Decode" >> beam.Map(lambda x: x.decode('utf-8'))
| 'Parse' >> beam.Map(parse_json)
| ' data w' >> beam.WindowInto(
FixedWindows(60),
accumulation_mode=AccumulationMode.DISCARDING
))
event_data = (input
| 'filter events' >> beam.Filter(lambda x: x['t'] == 'event')
| 'encode et' >> beam.Map(lambda x: json.dumps(x))
| 'write events to file' >> fileio.WriteToFiles(
path='gs://extention/ga_analytics/events/', shards=0))
I need one file after my window fires, but the number of files is equal to the number of messages from Pubsub, can anyone help me?
current output files
but i need only one file.
I recently ran into this issue and dug into the source code:
fileio.WriteToFiles will attempt to output each element in the bundle as an individual file. WireToFiles will only fallback to writing to sharded files if the number of elements exceeds the number of writers.
To force a single file to be created for all elements in the window, set max_writers_per_bundle to 0
WriteToFiles(shards=1, max_writers_per_bundle=0)
You can pass WriteToFiles(shards=1, ...) to limit the output to a single shard per window and triggering.
I am attempting to implement the slowly updating global window side inputs example from the documentation from java into python and I am kinda stuck on what the AfterProcessingTime.pastFirstElementInPane() equivalent in python. For the map I've done something like this:
class ApiKeys(beam.DoFn):
def process(self, elm) -> Iterable[Dict[str, str]]:
yield TimestampedValue(
{"<api_key_1>": "<account_id_1>", "<api_key_2>": "<account_id_2>",},
elm,
)
map = beam.pvalue.AsSingleton(
p
| "trigger pipeline" >> beam.Create([None])
| "define schedule"
>> beam.Map(
lambda _: (
0, # would be timestamp.Timestamp.now() in production
20, # would be timestamp.MAX_TIMESTAMP in production
1, # would be around 1 hour or so in production
)
)
| "GenSequence"
>> PeriodicSequence()
| "ApplyWindowing"
>> beam.WindowInto(
beam.window.GlobalWindows(),
trigger=Repeatedly(Always(), AfterProcessingTime(???)),
accumulation_mode=AccumulationMode.DISCARDING,
)
| "api_keys" >> beam.ParDo(ApiKeys())
)
I am hoping to use this as a Dict[str, str] input to a downstream function that will have windows of 60 seconds, merging with this one that I hope to update on an hourly basis.
The point is to run this on google cloud dataflow (where we currently just re-release it to update the api_keys).
I've pasted the java example from the documentation below for convenience sake:
public static void sideInputPatterns() {
// This pipeline uses View.asSingleton for a placeholder external service.
// Run in debug mode to see the output.
Pipeline p = Pipeline.create();
// Create a side input that updates each second.
PCollectionView<Map<String, String>> map =
p.apply(GenerateSequence.from(0).withRate(1, Duration.standardSeconds(5L)))
.apply(
Window.<Long>into(new GlobalWindows())
.triggering(Repeatedly.forever(AfterProcessingTime.pastFirstElementInPane()))
.discardingFiredPanes())
.apply(
ParDo.of(
new DoFn<Long, Map<String, String>>() {
#ProcessElement
public void process(
#Element Long input, OutputReceiver<Map<String, String>> o) {
// Replace map with test data from the placeholder external service.
// Add external reads here.
o.output(PlaceholderExternalService.readTestData());
}
}))
.apply(View.asSingleton());
// Consume side input. GenerateSequence generates test data.
// Use a real source (like PubSubIO or KafkaIO) in production.
p.apply(GenerateSequence.from(0).withRate(1, Duration.standardSeconds(1L)))
.apply(Window.into(FixedWindows.of(Duration.standardSeconds(1))))
.apply(Sum.longsGlobally().withoutDefaults())
.apply(
ParDo.of(
new DoFn<Long, KV<Long, Long>>() {
#ProcessElement
public void process(ProcessContext c) {
Map<String, String> keyMap = c.sideInput(map);
c.outputWithTimestamp(KV.of(1L, c.element()), Instant.now());
LOG.debug(
"Value is {}, key A is {}, and key B is {}.",
c.element(),
keyMap.get("Key_A"),
keyMap.get("Key_B"));
}
})
.withSideInputs(map));
}
/** Placeholder class that represents an external service generating test data. */
public static class PlaceholderExternalService {
public static Map<String, String> readTestData() {
Map<String, String> map = new HashMap<>();
Instant now = Instant.now();
DateTimeFormatter dtf = DateTimeFormat.forPattern("HH:MM:SS");
map.put("Key_A", now.minus(Duration.standardSeconds(30)).toString(dtf));
map.put("Key_B", now.minus(Duration.standardSeconds(30)).toString());
return map;
}
}
Any ideas as to how to emulate this example would be enormously appreciated, I've spent literally days on this issue now :(
Update #2 based on #AlexanderMoraes
So, I've tried changing it according to my understanding of your suggestions:
main_window_size = 5
trigger_interval = 30
side_input = beam.pvalue.AsSingleton(
p
| "trigger pipeline" >> beam.Create([None])
| "define schedule"
>> beam.Map(
lambda _: (
0, # timestamp.Timestamp.now().__float__(),
60, # timestamp.Timestamp.now().__float__() + 30.0,
trigger_interval, # fire_interval
)
)
| "GenSequence" >> PeriodicSequence()
| "api_keys" >> beam.ParDo(ApiKeys())
| "window"
>> beam.WindowInto(
beam.window.GlobalWindows(),
trigger=Repeatedly(AfterProcessingTime(window_size)),
accumulation_mode=AccumulationMode.DISCARDING,
)
)
But when combining this with another pipeline with windowing set to something smaller than trigger_interval I am unable to use the dictionary as a singleton because for some reason they are duplicated:
ValueError: PCollection of size 2 with more than one element accessed as a singleton view. First two elements encountered are "{'<api_key_1>': '<account_id_1>', '<api_key_2>': '<account_id_2>'}", "{'<api_key_1>': '<account_id_1>', '<api_key_2>': '<account_id_2>'}". [while running 'Pair with AccountIDs']
Is there some way to clarify that the singleton output should ignore whatever came before it?
The title of the question "slowly updating side inputs" refers to the documentation, which already has a Python version of the code. However, the code you provided is from "updating global window side inputs", which just has the Java version for the code. So I will be addressing an answer for the second one.
You are not able to reproduce the AfterProcessingTime.pastFirstElementInPane() within Python. This function is used to fire triggers, which determine when to emit results of each window (refered as pane). In your case, this particular call AfterProcessingTime.pastFirstElementInPane() creates a trigger that fires when the current processing time passes the processing time at which this trigger saw the first element in a pane, here. In Python this is achieve using AfterWatermark and AfterProcessingTime().
Below, there are two pieces of code one in Java and another one in Python. Thus, you can understand more about each one's usage. Both examples set a time-based trigger which emits results one minute after the first element of the window has been processed. Also, the accumulation mode is set for not accumulating the results (Java: discardingFiredPanes() and Python: accumulation_mode=AccumulationMode.DISCARDING).
1- Java:
PCollection<String> pc = ...;
pc.apply(Window.<String>into(FixedWindows.of(1, TimeUnit.MINUTES))
.triggering(AfterProcessingTime.pastFirstElementInPane()
.plusDelayOf(Duration.standardMinutes(1)))
.discardingFiredPanes());
2- Python: the trigger configuration is the same as described in point 1
pcollection | WindowInto(
FixedWindows(1 * 60),
trigger=AfterProcessingTime(1 * 60),
accumulation_mode=AccumulationMode.DISCARDING)
The examples above were taken from thew documentation.
AsIter() insted of AsSingleton() worked for me