I'm trying to count kafka message key, by using direct runner.
If I put max_num_records =20 in ReadFromKafka, I can see results printed or outputed to text.
like:
('2102', 5)
('2706', 5)
('2103', 5)
('2707', 5)
But without max_num_records, or if max_num_records is larger than message count in kafka topic, the program keeps running but nothing is outputed.
If I try to output with beam.io.WriteToText, there will be an empty temp folder created, like:
beam-temp-StatOut-d16768eadec511eb8bd897b012f36e97
Terminal shows:
2.30.0: Pulling from apache/beam_java8_sdk
Digest: sha256:720144b98d9cb2bcb21c2c0741d693b2ed54f85181dbd9963ba0aa9653072c19
Status: Image is up to date for apache/beam_java8_sdk:2.30.0
docker.io/apache/beam_java8_sdk:2.30.0
If I put 'enable.auto.commit': 'true' in kafka consumer config, the messages are commited, other clients from the same group can't read them, so I assume it's reading succesfully, just not processing or outputing.
I tried Fixed-time, Sliding time windowing, with or without different trigger, nothing changes.
Tried flink runner, got same result as direct runner.
No idea what I did wrong, any help?
environment:
centos 7
anaconda
python 3.8.8
java 1.8.0_292
beam 2.30
code as below:
direct_options = PipelineOptions([
"--runner=DirectRunner",
"--environment_type=LOOPBACK",
"--streaming",
])
direct_options.view_as(SetupOptions).save_main_session = True
direct_options.view_as(StandardOptions).streaming = True
conf = {'bootstrap.servers': '192.168.75.158:9092',
'group.id': "g17",
'enable.auto.commit': 'false',
'auto.offset.reset': 'earliest'}
if __name__ == '__main__':
with beam.Pipeline(options = direct_options) as p:
msg_kv_bytes = ( p
| 'ReadKafka' >> ReadFromKafka(consumer_config=conf,topics=['LaneIn']))
messages = msg_kv_bytes | 'Decode' >> beam.MapTuple(lambda k, v: (k.decode('utf-8'), v.decode('utf-8')))
counts = (
messages
| beam.WindowInto(
window.FixedWindows(10),
trigger = AfterCount(1),#AfterCount(4),#AfterProcessingTime
# allowed_lateness=3,
accumulation_mode = AccumulationMode.ACCUMULATING) #ACCUMULATING #DISCARDING
# | 'Windowsing' >> beam.WindowInto(window.FixedWindows(10, 5))
| 'TakeKeyPairWithOne' >> beam.MapTuple(lambda k, v: (k, 1))
| 'Grouping' >> beam.GroupByKey()
| 'Sum' >> beam.MapTuple(lambda k, v: (k, sum(v)))
)
output = (
counts
| 'Print' >> beam.ParDo(print)
# | 'WriteText' >> beam.io.WriteToText('/home/StatOut',file_name_suffix='.txt')
)
There are couple of known issues that you might be running into.
Beam's portable DirectRunner currently does not fully support streaming. Relevant Jira to follow is https://issues.apache.org/jira/browse/BEAM-7514
Beam's portable runners (including DirectRunner) has a known issue where streaming sources do not properly emit messages. Hence max_num_records or max_read_time arguments have to be provided to convert such sources to bounded sources. Relevant Jira to follow is https://issues.apache.org/jira/browse/BEAM-11998.
Related
I'm trying to use Apache Beam to create a pipeline that -
Reads messages from a stream
Groups the messages by some id
Appends the message to a file with a corresponding id
The append operation is not atomic - it's composed of reading the file's content, appending the new messages in memory, and overriding the file with the new content. I'm using fixed windowing, and I noticed that sometimes I have two consecutive windows that are fired together which causes a race with writing the file.
A way to reproduce this behavior is to use FixedWindows(5) (fixed window of 5 seconds) with AfterProcessingTime(20) which causes a window to be fired only every 20 seconds. So, 4 windows contained in this 20 seconds period will be fired together. For example, look at the following code -
pipeline_options = PipelineOptions(
streaming=True
)
with beam.Pipeline(options=pipeline_options) as p:
class SleepAndCon(beam.DoFn):
def process(self, element, timestamp=beam.DoFn.TimestampParam, window=beam.DoFn.WindowParam):
window_start = str(window.start.to_utc_datetime()).split()[1]
window_end = str(window.end.to_utc_datetime()).split()[1]
timestamp = str(timestamp.to_utc_datetime()).split()[1]
curr_time = str(datetime.utcnow()).split()[1].split(".")[0]
print(f"sleep_and_con {curr_time}-{timestamp}-{window_start}-{window_end}")
return [element]
def print_end(e):
print("end")
(p
| beam.io.ReadFromPubSub(subscription="...")
| beam.Map(lambda x: ('a', x))
| beam.WindowInto(
window.FixedWindows(5),
trigger=Repeatedly(AfterProcessingTime(20)),
accumulation_mode=False
)
| beam.GroupByKey()
| beam.ParDo(SleepAndCon())
| beam.Map(print_end)
)
After running and sending 5 messages every 5 seconds, we got the following output:
sleep_and_con 13:58:19-13:57:39.999999-13:57:35-13:57:40
sleep_and_con 13:58:19-13:57:44.999999-13:57:40-13:57:45
sleep_and_con 13:58:19-13:57:49.999999-13:57:45-13:57:50
sleep_and_con 13:58:19-13:57:54.999999-13:57:50-13:57:55
sleep_and_con 13:58:19-13:57:59.999999-13:57:55-13:58:00
end
end
end
end
end
But this behavior happens to me also without the AfterProcessingTime trigger, randomly (I didn't manage to understand when exactly it happens)
My question is - is there a way to block a window trigger until a previous window finishes the pipeline ?
Thanks!
I followed the code snippet of Slowly updating side input using windowing example but it printed nothing.
It only runs side_input instead of the whole.
I use DirectRunner for this.
import apache_beam as beam
from apache_beam.transforms.periodicsequence import PeriodicImpulse
from apache_beam.transforms.window import TimestampedValue
from apache_beam.transforms import window
def cross_join(left, rights):
for x in rights:
yield left, x
if __name__ == '__main__':
data = list(range(1, 100))
pattern = 'pat'
main_interval = 10
side_interval = 5
pipeline = beam.Pipeline()
side_input = (
pipeline
| 'PeriodicImpulse' >> PeriodicImpulse(fire_interval=side_interval, apply_windowing=True)
| 'MapToFileName' >> beam.Map(lambda x: pattern + str(x)))
main_input = (
pipeline
| 'MpImpulse' >> beam.Create(data)
| 'MapMpToTimestamped' >> beam.Map(lambda src: TimestampedValue(src, src))
| 'WindowMpInto' >> beam.WindowInto(window.FixedWindows(main_interval)))
result = (
main_input
| 'ApplyCrossJoin' >> beam.FlatMap(cross_join, rights=beam.pvalue.AsIter(side_input))
| 'log' >> beam.Map(print))
res = pipeline.run()
res.wait_until_finish()
thanks and regards.
This is because the Python FnApiRunner does not yet support streaming mode. It is trying to run PeriodicImpulse to completion before executing the next stage of the pipeline. This is curently being worked on at BEAM-7514.
Unfortunately, it looks like the BundleBasedDirectRunner which can do streaming in Python has issues with PeriodicImpulse as well.
You could try using a "real" runner, e.g. Flink in local mode, by passing runner='Flink' to your pipeline creation.
Note also that your side input needs to be windowed as well, and the side and main inputs will be lined up by window.
need some help. I have some trivial task reading from Pub/Sub and write to batch file in GCS, but have some struggle with fileio.WriteToFiles
with beam.Pipeline(options=pipeline_options) as p:
input = (p | 'ReadData' >> beam.io.ReadFromPubSub(topic=known_args.input_topic).with_output_types(bytes)
| "Decode" >> beam.Map(lambda x: x.decode('utf-8'))
| 'Parse' >> beam.Map(parse_json)
| ' data w' >> beam.WindowInto(
FixedWindows(60),
accumulation_mode=AccumulationMode.DISCARDING
))
event_data = (input
| 'filter events' >> beam.Filter(lambda x: x['t'] == 'event')
| 'encode et' >> beam.Map(lambda x: json.dumps(x))
| 'write events to file' >> fileio.WriteToFiles(
path='gs://extention/ga_analytics/events/', shards=0))
I need one file after my window fires, but the number of files is equal to the number of messages from Pubsub, can anyone help me?
current output files
but i need only one file.
I recently ran into this issue and dug into the source code:
fileio.WriteToFiles will attempt to output each element in the bundle as an individual file. WireToFiles will only fallback to writing to sharded files if the number of elements exceeds the number of writers.
To force a single file to be created for all elements in the window, set max_writers_per_bundle to 0
WriteToFiles(shards=1, max_writers_per_bundle=0)
You can pass WriteToFiles(shards=1, ...) to limit the output to a single shard per window and triggering.
I am attempting to implement the slowly updating global window side inputs example from the documentation from java into python and I am kinda stuck on what the AfterProcessingTime.pastFirstElementInPane() equivalent in python. For the map I've done something like this:
class ApiKeys(beam.DoFn):
def process(self, elm) -> Iterable[Dict[str, str]]:
yield TimestampedValue(
{"<api_key_1>": "<account_id_1>", "<api_key_2>": "<account_id_2>",},
elm,
)
map = beam.pvalue.AsSingleton(
p
| "trigger pipeline" >> beam.Create([None])
| "define schedule"
>> beam.Map(
lambda _: (
0, # would be timestamp.Timestamp.now() in production
20, # would be timestamp.MAX_TIMESTAMP in production
1, # would be around 1 hour or so in production
)
)
| "GenSequence"
>> PeriodicSequence()
| "ApplyWindowing"
>> beam.WindowInto(
beam.window.GlobalWindows(),
trigger=Repeatedly(Always(), AfterProcessingTime(???)),
accumulation_mode=AccumulationMode.DISCARDING,
)
| "api_keys" >> beam.ParDo(ApiKeys())
)
I am hoping to use this as a Dict[str, str] input to a downstream function that will have windows of 60 seconds, merging with this one that I hope to update on an hourly basis.
The point is to run this on google cloud dataflow (where we currently just re-release it to update the api_keys).
I've pasted the java example from the documentation below for convenience sake:
public static void sideInputPatterns() {
// This pipeline uses View.asSingleton for a placeholder external service.
// Run in debug mode to see the output.
Pipeline p = Pipeline.create();
// Create a side input that updates each second.
PCollectionView<Map<String, String>> map =
p.apply(GenerateSequence.from(0).withRate(1, Duration.standardSeconds(5L)))
.apply(
Window.<Long>into(new GlobalWindows())
.triggering(Repeatedly.forever(AfterProcessingTime.pastFirstElementInPane()))
.discardingFiredPanes())
.apply(
ParDo.of(
new DoFn<Long, Map<String, String>>() {
#ProcessElement
public void process(
#Element Long input, OutputReceiver<Map<String, String>> o) {
// Replace map with test data from the placeholder external service.
// Add external reads here.
o.output(PlaceholderExternalService.readTestData());
}
}))
.apply(View.asSingleton());
// Consume side input. GenerateSequence generates test data.
// Use a real source (like PubSubIO or KafkaIO) in production.
p.apply(GenerateSequence.from(0).withRate(1, Duration.standardSeconds(1L)))
.apply(Window.into(FixedWindows.of(Duration.standardSeconds(1))))
.apply(Sum.longsGlobally().withoutDefaults())
.apply(
ParDo.of(
new DoFn<Long, KV<Long, Long>>() {
#ProcessElement
public void process(ProcessContext c) {
Map<String, String> keyMap = c.sideInput(map);
c.outputWithTimestamp(KV.of(1L, c.element()), Instant.now());
LOG.debug(
"Value is {}, key A is {}, and key B is {}.",
c.element(),
keyMap.get("Key_A"),
keyMap.get("Key_B"));
}
})
.withSideInputs(map));
}
/** Placeholder class that represents an external service generating test data. */
public static class PlaceholderExternalService {
public static Map<String, String> readTestData() {
Map<String, String> map = new HashMap<>();
Instant now = Instant.now();
DateTimeFormatter dtf = DateTimeFormat.forPattern("HH:MM:SS");
map.put("Key_A", now.minus(Duration.standardSeconds(30)).toString(dtf));
map.put("Key_B", now.minus(Duration.standardSeconds(30)).toString());
return map;
}
}
Any ideas as to how to emulate this example would be enormously appreciated, I've spent literally days on this issue now :(
Update #2 based on #AlexanderMoraes
So, I've tried changing it according to my understanding of your suggestions:
main_window_size = 5
trigger_interval = 30
side_input = beam.pvalue.AsSingleton(
p
| "trigger pipeline" >> beam.Create([None])
| "define schedule"
>> beam.Map(
lambda _: (
0, # timestamp.Timestamp.now().__float__(),
60, # timestamp.Timestamp.now().__float__() + 30.0,
trigger_interval, # fire_interval
)
)
| "GenSequence" >> PeriodicSequence()
| "api_keys" >> beam.ParDo(ApiKeys())
| "window"
>> beam.WindowInto(
beam.window.GlobalWindows(),
trigger=Repeatedly(AfterProcessingTime(window_size)),
accumulation_mode=AccumulationMode.DISCARDING,
)
)
But when combining this with another pipeline with windowing set to something smaller than trigger_interval I am unable to use the dictionary as a singleton because for some reason they are duplicated:
ValueError: PCollection of size 2 with more than one element accessed as a singleton view. First two elements encountered are "{'<api_key_1>': '<account_id_1>', '<api_key_2>': '<account_id_2>'}", "{'<api_key_1>': '<account_id_1>', '<api_key_2>': '<account_id_2>'}". [while running 'Pair with AccountIDs']
Is there some way to clarify that the singleton output should ignore whatever came before it?
The title of the question "slowly updating side inputs" refers to the documentation, which already has a Python version of the code. However, the code you provided is from "updating global window side inputs", which just has the Java version for the code. So I will be addressing an answer for the second one.
You are not able to reproduce the AfterProcessingTime.pastFirstElementInPane() within Python. This function is used to fire triggers, which determine when to emit results of each window (refered as pane). In your case, this particular call AfterProcessingTime.pastFirstElementInPane() creates a trigger that fires when the current processing time passes the processing time at which this trigger saw the first element in a pane, here. In Python this is achieve using AfterWatermark and AfterProcessingTime().
Below, there are two pieces of code one in Java and another one in Python. Thus, you can understand more about each one's usage. Both examples set a time-based trigger which emits results one minute after the first element of the window has been processed. Also, the accumulation mode is set for not accumulating the results (Java: discardingFiredPanes() and Python: accumulation_mode=AccumulationMode.DISCARDING).
1- Java:
PCollection<String> pc = ...;
pc.apply(Window.<String>into(FixedWindows.of(1, TimeUnit.MINUTES))
.triggering(AfterProcessingTime.pastFirstElementInPane()
.plusDelayOf(Duration.standardMinutes(1)))
.discardingFiredPanes());
2- Python: the trigger configuration is the same as described in point 1
pcollection | WindowInto(
FixedWindows(1 * 60),
trigger=AfterProcessingTime(1 * 60),
accumulation_mode=AccumulationMode.DISCARDING)
The examples above were taken from thew documentation.
AsIter() insted of AsSingleton() worked for me
We have an app that has users; each user uses our app for something like 10-40 minutes per go and I would like to count the distribution/occurrences of events happing per-such-session, based on specific events having happened (e.g. "this user converted", "this user had a problem last session", "this user had a successful last session").
(After this I'd like to count these higher-level events per day, but that's a separate question)
For this I've been looking into session windows; but all docs seem geared towards global session windows, but I'd like to create them per-user (which is also a natural partitioning).
I'm having trouble finding docs (python preferred) on how to do this. Could you point me in the right direction?
Or in other words: How do I create per-user per-session windows that can output more structured (enriched) events?
What I have
class DebugPrinter(beam.DoFn):
"""Just prints the element with logging"""
def process(self, element, window=beam.DoFn.WindowParam):
_, x = element
logging.info(">>> Received %s %s with window=%s", x['jsonPayload']['value'], x['timestamp'], window)
yield element
def sum_by_event_type(user_session_events):
logging.debug("Received %i events: %s", len(user_session_events), user_session_events)
d = {}
for key, group in groupby(user_session_events, lambda e: e['jsonPayload']['value']):
d[key] = len(list(group))
logging.info("After counting: %s", d)
return d
# ...
by_user = valid \
| 'keyed_on_user_id' >> beam.Map(lambda x: (x['jsonPayload']['userId'], x))
session_gap = 5 * 60 # [s]; 5 minutes
user_sessions = by_user \
| 'user_session_window' >> beam.WindowInto(beam.window.Sessions(session_gap),
timestamp_combiner=beam.window.TimestampCombiner.OUTPUT_AT_EOW) \
| 'debug_printer' >> beam.ParDo(DebugPrinter()) \
| beam.CombinePerKey(sum_by_event_type)
What it outputs
INFO:root:>>> Received event_1 2019-03-12T08:54:29.200Z with window=[1552380869.2, 1552381169.2)
INFO:root:>>> Received event_2 2019-03-12T08:54:29.200Z with window=[1552380869.2, 1552381169.2)
INFO:root:>>> Received event_3 2019-03-12T08:54:30.400Z with window=[1552380870.4, 1552381170.4)
INFO:root:>>> Received event_4 2019-03-12T08:54:36.300Z with window=[1552380876.3, 1552381176.3)
INFO:root:>>> Received event_5 2019-03-12T08:54:38.100Z with window=[1552380878.1, 1552381178.1)
So as you can see; the Session() window doesn't expand the Window, but groups only very close events together... What's being done wrong?
You can get it to work by adding a Group By Key transform after the windowing. You have assigned keys to the records but haven't actually grouped them together by key and session windowing (which works per-key) does not know that these events need to be merged together.
To confirm this I did a reproducible example with some in-memory dummy data (to isolate Pub/Sub from the problem and be able to test it more quickly). All five events will have the same key or user_id but they will "arrive" sequentially 1, 2, 4 and 8 seconds apart from each other. As I use session_gap of 5 seconds I expect the first 4 elements to be merged into the same session. The 5th event will take 8 seconds after the 4th one so it has to be relegated to the next session (gap over 5s). Data is created like this:
data = [{'user_id': 'Thanos', 'value': 'event_{}'.format(event), 'timestamp': time.time() + 2**event} for event in range(5)]
We use beam.Create(data) to initialize the pipeline and beam.window.TimestampedValue to assign the "fake" timestamps. Again, we are just simulating streaming behavior with this. After that, we create the key-value pairs thanks to the user_id field, we window into window.Sessions and, we add the missing beam.GroupByKey() step. Finally, we log the results with a slightly modified version of DebugPrinter:. The pipeline now looks like this:
events = (p
| 'Create Events' >> beam.Create(data) \
| 'Add Timestamps' >> beam.Map(lambda x: beam.window.TimestampedValue(x, x['timestamp'])) \
| 'keyed_on_user_id' >> beam.Map(lambda x: (x['user_id'], x))
| 'user_session_window' >> beam.WindowInto(window.Sessions(session_gap),
timestamp_combiner=window.TimestampCombiner.OUTPUT_AT_EOW) \
| 'Group' >> beam.GroupByKey()
| 'debug_printer' >> beam.ParDo(DebugPrinter()))
where DebugPrinter is:
class DebugPrinter(beam.DoFn):
"""Just prints the element with logging"""
def process(self, element, window=beam.DoFn.WindowParam):
for x in element[1]:
logging.info(">>> Received %s %s with window=%s", x['value'], x['timestamp'], window)
yield element
If we test this without grouping by key we get the same behavior:
INFO:root:>>> Received event_0 1554117323.0 with window=[1554117323.0, 1554117328.0)
INFO:root:>>> Received event_1 1554117324.0 with window=[1554117324.0, 1554117329.0)
INFO:root:>>> Received event_2 1554117326.0 with window=[1554117326.0, 1554117331.0)
INFO:root:>>> Received event_3 1554117330.0 with window=[1554117330.0, 1554117335.0)
INFO:root:>>> Received event_4 1554117338.0 with window=[1554117338.0, 1554117343.0)
But after adding it, the windows now work as expected. Events 0 to 3 are merged together in an extended 12s session window. Event 4 belongs to a separate 5s session.
INFO:root:>>> Received event_0 1554118377.37 with window=[1554118377.37, 1554118389.37)
INFO:root:>>> Received event_1 1554118378.37 with window=[1554118377.37, 1554118389.37)
INFO:root:>>> Received event_3 1554118384.37 with window=[1554118377.37, 1554118389.37)
INFO:root:>>> Received event_2 1554118380.37 with window=[1554118377.37, 1554118389.37)
INFO:root:>>> Received event_4 1554118392.37 with window=[1554118392.37, 1554118397.37)
Full code here
Two additional things worth mentioning. The first one is that, even if running this locally in a single machine with the DirectRunner, records can come unordered (event_3 is processed before event_2 in my case). This is done on purpose to simulate distributed processing as documented here.
The last one is that if you get a stack trace like this:
TypeError: Cannot convert GlobalWindow to apache_beam.utils.windowed_value._IntervalWindowBase [while running 'Write Results/Write/WriteImpl/WriteBundles']
downgrade from 2.10.0/2.11.0 SDK to 2.9.0. See this answer for example.