Apache Beam per-user session windows are unmerged - python

We have an app that has users; each user uses our app for something like 10-40 minutes per go and I would like to count the distribution/occurrences of events happing per-such-session, based on specific events having happened (e.g. "this user converted", "this user had a problem last session", "this user had a successful last session").
(After this I'd like to count these higher-level events per day, but that's a separate question)
For this I've been looking into session windows; but all docs seem geared towards global session windows, but I'd like to create them per-user (which is also a natural partitioning).
I'm having trouble finding docs (python preferred) on how to do this. Could you point me in the right direction?
Or in other words: How do I create per-user per-session windows that can output more structured (enriched) events?
What I have
class DebugPrinter(beam.DoFn):
"""Just prints the element with logging"""
def process(self, element, window=beam.DoFn.WindowParam):
_, x = element
logging.info(">>> Received %s %s with window=%s", x['jsonPayload']['value'], x['timestamp'], window)
yield element
def sum_by_event_type(user_session_events):
logging.debug("Received %i events: %s", len(user_session_events), user_session_events)
d = {}
for key, group in groupby(user_session_events, lambda e: e['jsonPayload']['value']):
d[key] = len(list(group))
logging.info("After counting: %s", d)
return d
# ...
by_user = valid \
| 'keyed_on_user_id' >> beam.Map(lambda x: (x['jsonPayload']['userId'], x))
session_gap = 5 * 60 # [s]; 5 minutes
user_sessions = by_user \
| 'user_session_window' >> beam.WindowInto(beam.window.Sessions(session_gap),
timestamp_combiner=beam.window.TimestampCombiner.OUTPUT_AT_EOW) \
| 'debug_printer' >> beam.ParDo(DebugPrinter()) \
| beam.CombinePerKey(sum_by_event_type)
What it outputs
INFO:root:>>> Received event_1 2019-03-12T08:54:29.200Z with window=[1552380869.2, 1552381169.2)
INFO:root:>>> Received event_2 2019-03-12T08:54:29.200Z with window=[1552380869.2, 1552381169.2)
INFO:root:>>> Received event_3 2019-03-12T08:54:30.400Z with window=[1552380870.4, 1552381170.4)
INFO:root:>>> Received event_4 2019-03-12T08:54:36.300Z with window=[1552380876.3, 1552381176.3)
INFO:root:>>> Received event_5 2019-03-12T08:54:38.100Z with window=[1552380878.1, 1552381178.1)
So as you can see; the Session() window doesn't expand the Window, but groups only very close events together... What's being done wrong?

You can get it to work by adding a Group By Key transform after the windowing. You have assigned keys to the records but haven't actually grouped them together by key and session windowing (which works per-key) does not know that these events need to be merged together.
To confirm this I did a reproducible example with some in-memory dummy data (to isolate Pub/Sub from the problem and be able to test it more quickly). All five events will have the same key or user_id but they will "arrive" sequentially 1, 2, 4 and 8 seconds apart from each other. As I use session_gap of 5 seconds I expect the first 4 elements to be merged into the same session. The 5th event will take 8 seconds after the 4th one so it has to be relegated to the next session (gap over 5s). Data is created like this:
data = [{'user_id': 'Thanos', 'value': 'event_{}'.format(event), 'timestamp': time.time() + 2**event} for event in range(5)]
We use beam.Create(data) to initialize the pipeline and beam.window.TimestampedValue to assign the "fake" timestamps. Again, we are just simulating streaming behavior with this. After that, we create the key-value pairs thanks to the user_id field, we window into window.Sessions and, we add the missing beam.GroupByKey() step. Finally, we log the results with a slightly modified version of DebugPrinter:. The pipeline now looks like this:
events = (p
| 'Create Events' >> beam.Create(data) \
| 'Add Timestamps' >> beam.Map(lambda x: beam.window.TimestampedValue(x, x['timestamp'])) \
| 'keyed_on_user_id' >> beam.Map(lambda x: (x['user_id'], x))
| 'user_session_window' >> beam.WindowInto(window.Sessions(session_gap),
timestamp_combiner=window.TimestampCombiner.OUTPUT_AT_EOW) \
| 'Group' >> beam.GroupByKey()
| 'debug_printer' >> beam.ParDo(DebugPrinter()))
where DebugPrinter is:
class DebugPrinter(beam.DoFn):
"""Just prints the element with logging"""
def process(self, element, window=beam.DoFn.WindowParam):
for x in element[1]:
logging.info(">>> Received %s %s with window=%s", x['value'], x['timestamp'], window)
yield element
If we test this without grouping by key we get the same behavior:
INFO:root:>>> Received event_0 1554117323.0 with window=[1554117323.0, 1554117328.0)
INFO:root:>>> Received event_1 1554117324.0 with window=[1554117324.0, 1554117329.0)
INFO:root:>>> Received event_2 1554117326.0 with window=[1554117326.0, 1554117331.0)
INFO:root:>>> Received event_3 1554117330.0 with window=[1554117330.0, 1554117335.0)
INFO:root:>>> Received event_4 1554117338.0 with window=[1554117338.0, 1554117343.0)
But after adding it, the windows now work as expected. Events 0 to 3 are merged together in an extended 12s session window. Event 4 belongs to a separate 5s session.
INFO:root:>>> Received event_0 1554118377.37 with window=[1554118377.37, 1554118389.37)
INFO:root:>>> Received event_1 1554118378.37 with window=[1554118377.37, 1554118389.37)
INFO:root:>>> Received event_3 1554118384.37 with window=[1554118377.37, 1554118389.37)
INFO:root:>>> Received event_2 1554118380.37 with window=[1554118377.37, 1554118389.37)
INFO:root:>>> Received event_4 1554118392.37 with window=[1554118392.37, 1554118397.37)
Full code here
Two additional things worth mentioning. The first one is that, even if running this locally in a single machine with the DirectRunner, records can come unordered (event_3 is processed before event_2 in my case). This is done on purpose to simulate distributed processing as documented here.
The last one is that if you get a stack trace like this:
TypeError: Cannot convert GlobalWindow to apache_beam.utils.windowed_value._IntervalWindowBase [while running 'Write Results/Write/WriteImpl/WriteBundles']
downgrade from 2.10.0/2.11.0 SDK to 2.9.0. See this answer for example.

Related

Apache Beam - Fire one window at a time

I'm trying to use Apache Beam to create a pipeline that -
Reads messages from a stream
Groups the messages by some id
Appends the message to a file with a corresponding id
The append operation is not atomic - it's composed of reading the file's content, appending the new messages in memory, and overriding the file with the new content. I'm using fixed windowing, and I noticed that sometimes I have two consecutive windows that are fired together which causes a race with writing the file.
A way to reproduce this behavior is to use FixedWindows(5) (fixed window of 5 seconds) with AfterProcessingTime(20) which causes a window to be fired only every 20 seconds. So, 4 windows contained in this 20 seconds period will be fired together. For example, look at the following code -
pipeline_options = PipelineOptions(
streaming=True
)
with beam.Pipeline(options=pipeline_options) as p:
class SleepAndCon(beam.DoFn):
def process(self, element, timestamp=beam.DoFn.TimestampParam, window=beam.DoFn.WindowParam):
window_start = str(window.start.to_utc_datetime()).split()[1]
window_end = str(window.end.to_utc_datetime()).split()[1]
timestamp = str(timestamp.to_utc_datetime()).split()[1]
curr_time = str(datetime.utcnow()).split()[1].split(".")[0]
print(f"sleep_and_con {curr_time}-{timestamp}-{window_start}-{window_end}")
return [element]
def print_end(e):
print("end")
(p
| beam.io.ReadFromPubSub(subscription="...")
| beam.Map(lambda x: ('a', x))
| beam.WindowInto(
window.FixedWindows(5),
trigger=Repeatedly(AfterProcessingTime(20)),
accumulation_mode=False
)
| beam.GroupByKey()
| beam.ParDo(SleepAndCon())
| beam.Map(print_end)
)
After running and sending 5 messages every 5 seconds, we got the following output:
sleep_and_con 13:58:19-13:57:39.999999-13:57:35-13:57:40
sleep_and_con 13:58:19-13:57:44.999999-13:57:40-13:57:45
sleep_and_con 13:58:19-13:57:49.999999-13:57:45-13:57:50
sleep_and_con 13:58:19-13:57:54.999999-13:57:50-13:57:55
sleep_and_con 13:58:19-13:57:59.999999-13:57:55-13:58:00
end
end
end
end
end
But this behavior happens to me also without the AfterProcessingTime trigger, randomly (I didn't manage to understand when exactly it happens)
My question is - is there a way to block a window trigger until a previous window finishes the pipeline ?
Thanks!

Beam Kafka Streaming Input, No Output to print or text

I'm trying to count kafka message key, by using direct runner.
If I put max_num_records =20 in ReadFromKafka, I can see results printed or outputed to text.
like:
('2102', 5)
('2706', 5)
('2103', 5)
('2707', 5)
But without max_num_records, or if max_num_records is larger than message count in kafka topic, the program keeps running but nothing is outputed.
If I try to output with beam.io.WriteToText, there will be an empty temp folder created, like:
beam-temp-StatOut-d16768eadec511eb8bd897b012f36e97
Terminal shows:
2.30.0: Pulling from apache/beam_java8_sdk
Digest: sha256:720144b98d9cb2bcb21c2c0741d693b2ed54f85181dbd9963ba0aa9653072c19
Status: Image is up to date for apache/beam_java8_sdk:2.30.0
docker.io/apache/beam_java8_sdk:2.30.0
If I put 'enable.auto.commit': 'true' in kafka consumer config, the messages are commited, other clients from the same group can't read them, so I assume it's reading succesfully, just not processing or outputing.
I tried Fixed-time, Sliding time windowing, with or without different trigger, nothing changes.
Tried flink runner, got same result as direct runner.
No idea what I did wrong, any help?
environment:
centos 7
anaconda
python 3.8.8
java 1.8.0_292
beam 2.30
code as below:
direct_options = PipelineOptions([
"--runner=DirectRunner",
"--environment_type=LOOPBACK",
"--streaming",
])
direct_options.view_as(SetupOptions).save_main_session = True
direct_options.view_as(StandardOptions).streaming = True
conf = {'bootstrap.servers': '192.168.75.158:9092',
'group.id': "g17",
'enable.auto.commit': 'false',
'auto.offset.reset': 'earliest'}
if __name__ == '__main__':
with beam.Pipeline(options = direct_options) as p:
msg_kv_bytes = ( p
| 'ReadKafka' >> ReadFromKafka(consumer_config=conf,topics=['LaneIn']))
messages = msg_kv_bytes | 'Decode' >> beam.MapTuple(lambda k, v: (k.decode('utf-8'), v.decode('utf-8')))
counts = (
messages
| beam.WindowInto(
window.FixedWindows(10),
trigger = AfterCount(1),#AfterCount(4),#AfterProcessingTime
# allowed_lateness=3,
accumulation_mode = AccumulationMode.ACCUMULATING) #ACCUMULATING #DISCARDING
# | 'Windowsing' >> beam.WindowInto(window.FixedWindows(10, 5))
| 'TakeKeyPairWithOne' >> beam.MapTuple(lambda k, v: (k, 1))
| 'Grouping' >> beam.GroupByKey()
| 'Sum' >> beam.MapTuple(lambda k, v: (k, sum(v)))
)
output = (
counts
| 'Print' >> beam.ParDo(print)
# | 'WriteText' >> beam.io.WriteToText('/home/StatOut',file_name_suffix='.txt')
)
There are couple of known issues that you might be running into.
Beam's portable DirectRunner currently does not fully support streaming. Relevant Jira to follow is https://issues.apache.org/jira/browse/BEAM-7514
Beam's portable runners (including DirectRunner) has a known issue where streaming sources do not properly emit messages. Hence max_num_records or max_read_time arguments have to be provided to convert such sources to bounded sources. Relevant Jira to follow is https://issues.apache.org/jira/browse/BEAM-11998.

Write just one file per window with Apache Beam Python `WriteToFiles` transform

need some help. I have some trivial task reading from Pub/Sub and write to batch file in GCS, but have some struggle with fileio.WriteToFiles
with beam.Pipeline(options=pipeline_options) as p:
input = (p | 'ReadData' >> beam.io.ReadFromPubSub(topic=known_args.input_topic).with_output_types(bytes)
| "Decode" >> beam.Map(lambda x: x.decode('utf-8'))
| 'Parse' >> beam.Map(parse_json)
| ' data w' >> beam.WindowInto(
FixedWindows(60),
accumulation_mode=AccumulationMode.DISCARDING
))
event_data = (input
| 'filter events' >> beam.Filter(lambda x: x['t'] == 'event')
| 'encode et' >> beam.Map(lambda x: json.dumps(x))
| 'write events to file' >> fileio.WriteToFiles(
path='gs://extention/ga_analytics/events/', shards=0))
I need one file after my window fires, but the number of files is equal to the number of messages from Pubsub, can anyone help me?
current output files
but i need only one file.
I recently ran into this issue and dug into the source code:
fileio.WriteToFiles will attempt to output each element in the bundle as an individual file. WireToFiles will only fallback to writing to sharded files if the number of elements exceeds the number of writers.
To force a single file to be created for all elements in the window, set max_writers_per_bundle to 0
WriteToFiles(shards=1, max_writers_per_bundle=0)
You can pass WriteToFiles(shards=1, ...) to limit the output to a single shard per window and triggering.

Is there a way to group records by fields performing calculations on other fields in a dataflow pipeline for a period?

I'm trying to create a streaming pipeline with Dataflow that reads messages from a PubSub topic and write grouped results to a BigQuery table. I don't want to use any template. For the moment I just want to create a pipeline in a Python3 script executed from a Google VM Instance to carry out a process of transformation of the data that arrives from Pubsub (the structure of the messages is what the table expects). In this process I want to group by the fields "A" and "B" and calculate the total occurrences, the sum of the field "C" and the average of the field "D".
The messages published in the PubSub topic come as follows:
{"A":"Alpha", "B":"V1", "C":3, "D":12}
{"A":"Alpha", "B":"V1", "C":5, "D":14}
{"A":"Alpha", "B":"V1", "C":3, "D":22}
{"A":"Beta", "B":"V1", "C":2, "D":6}
{"A":"Beta", "B":"V1", "C":7, "D":19}
{"A":"Beta", "B":"V2", "C":3, "D":10}
{"A":"Beta", "B":"V2", "C":5, "D":12}
The output with this records should be something like this:
{"A-B":"AlphaV1", "Occurs":3, "sum_C":11, "avg_D":16}
{"A-B":"BetaV1", "Occurs":2, "sum_C":9, "avg_D":12.5}
{"A-B":"BetaV2", "Occurs":2, "sum_C":8, "avg_D":11}
How can I define a function in Apache Beam in order to do that process?
Thanks!
You can do all of this with a simple GroupByKey, and a custom aggregation. There is one big question that you'll need to ponder yourself: How do you want to window your data?
You need to window your data, because the runner needs to figure out when to stop waiting for more data on the same key. Happy to chat more about windowing if you get stuck with that.
Here is how you can perform your aggregation, and we just "assume" the windowing:
def compute_keys(elm):
key = '%s%s' % (elm.get('A'), elm.get('B'))
return (key, elm)
def perform_aggregations_per_key(key_values):
key, values = key_values
values = list(values) # This will load all values for a single key into memory!
sum_C = sum(v['C'] for v in values)
avg_D = sum(v['D'] for v in values) / len(values)
occurs = len(values)
return {'A-B': key,
'Occurs': occurs,
'sum_C': sum_C,
'avg_D': avg_D}
my_inputs = (p | ReadFromPubSub(.....))
windowed_inputs = (my_inputs
| beam.WindowInto(....)) # You need to window your stream
result = (windowed_inputs
| beam.Map(compute_keys)
| beam.GroupByKey()
| beam.Map(perform_aggregations_per_key))

mySQL Trigger works after console insert, but not after script insert

I have a problem with a trigger.
I set up a trigger for update other tables after an insert in a table.
If I make an insert from MySQL console, all works fine, but if I do inserts, even with the same data, from an external python script, the trigger does nothing, as you can see bellow.
I tried changing the Definer to 'user'#'%' and 'root'#'%', but it's still doing nothing.
mysql> select vid_visit,vid_money from videos where video_id=487;
+-----------+-----------+
| vid_visit | vid_money |
+-----------+-----------+
| 21 | 0.297 |
+-----------+-----------+
1 row in set (0,01 sec)
mysql> INSERT INTO `table`.`validEvents` ( `id` , `campaigns_id` , `video_id` , `date` , `producer_id` , `distributor_id` , `money_producer` , `money_distributor` , `type` ) VALUES ( NULL , '30', '487', '2010-05-20 01:20:00', '1', '0', '0.009', '0.000', 'PRE' );
Query OK, 1 row affected (0,00 sec)
mysql> select vid_visit,vid_money from videos where video_id=487;
+-----------+-----------+
| vid_visit | vid_money |
+-----------+-----------+
| 22 | 0.306 |
+-----------+-----------+
DROP TRIGGER IF EXISTS `updateVisitAndMoney`//
CREATE TRIGGER `updateVisitAndMoney` BEFORE INSERT ON `validEvents`
FOR EACH ROW BEGIN
if (NEW.type = 'PRE') THEN
SET #eventcash=NEW.money_producer + NEW.money_distributor;
UPDATE campaigns SET cmp_visit_distributed = cmp_visit_distributed + 1 , cmp_money_distributed = cmp_money_distributed + NEW.money_producer + NEW.money_distributor WHERE idcampaigns = NEW.campaigns_id;
UPDATE offer_producer SET ofp_visit_procesed = ofp_visit_procesed + 1 , ofp_money_procesed = ofp_money_procesed + NEW. money_producer WHERE ofp_video_id = NEW.video_id AND ofp_money_procesed = NEW. campaigns_id;
UPDATE videos SET vid_visit = vid_visit + 1 , vid_money = vid_money + #eventcash WHERE video_id = NEW.video_id;
if (NEW.distributor_id != '') then
UPDATE agreements SET visit_procesed = visit_procesed + 1, money_producer = money_producer + NEW.money_producer, money_distributor = money_distributor + NEW.money_distributor WHERE id_campaigns = NEW. campaigns_id AND id_video = NEW.video_id AND ag_distributor_id = NEW.distributor_id;
UPDATE eventForDay SET visit = visit + 1, money = money + NEW. money_distributor WHERE date = SYSDATE() AND campaign_id = NEW. campaigns_id AND user_id = NEW.distributor_id;
UPDATE eventForDay SET visit = visit + 1, money = money + NEW.money_producer WHERE date = SYSDATE() AND campaign_id = NEW. campaigns_id AND user_id= NEW.producer_id;
ELSE
UPDATE eventForDay SET visit = visit + 1, money = money + NEW. money_producer WHERE date = SYSDATE() AND campaign_id = NEW. campaigns_id AND user_id = NEW.producer_id;
END IF;
END IF;
END
//
I think that it's far more likely that you are encountering an uncaught error, rather than that the trigger is not executing, particularly since it executes successfully from the console.
You need to isolate where the error occurs - in the trigger itself, or in the calling script.
In your python script, print out the SQL statement that python sends to MySQL for execution in order to ensure that it is constructed as you expect - for example, if NEW.type does not equal 'PRE', the trigger will have executed, but will not result in any updates.
Also ensure that you are checking for errors on the insert. I'm not a python programmer, so I can't tell you how that is done, but this seems to be what you're looking for.
If neither of these leads you to the problem, comment out the whole if (NEW.type = 'PRE') THEN block and do a simple modification, such as setting NEW.type to 'debug'. After ensuring that the trigger does in fact execute, retest successively with more of the real code added back in until you isolate the problem.
Also, wrt Marcos comment, I would be surprised if the script didn't auto-commit upon successful completion. Indeed, I would make this statement about any script/language.
For anyone finding this question in the future - I am guessing the solution is setting mysql autocommit to true.
after opening the connection query the following:
SET autocommit = 1
Shell commands only run in the console, not on the actual server. You can use UDF or polling to accomplish what you need.

Categories

Resources