I'm trying to use Apache Beam to create a pipeline that -
Reads messages from a stream
Groups the messages by some id
Appends the message to a file with a corresponding id
The append operation is not atomic - it's composed of reading the file's content, appending the new messages in memory, and overriding the file with the new content. I'm using fixed windowing, and I noticed that sometimes I have two consecutive windows that are fired together which causes a race with writing the file.
A way to reproduce this behavior is to use FixedWindows(5) (fixed window of 5 seconds) with AfterProcessingTime(20) which causes a window to be fired only every 20 seconds. So, 4 windows contained in this 20 seconds period will be fired together. For example, look at the following code -
pipeline_options = PipelineOptions(
streaming=True
)
with beam.Pipeline(options=pipeline_options) as p:
class SleepAndCon(beam.DoFn):
def process(self, element, timestamp=beam.DoFn.TimestampParam, window=beam.DoFn.WindowParam):
window_start = str(window.start.to_utc_datetime()).split()[1]
window_end = str(window.end.to_utc_datetime()).split()[1]
timestamp = str(timestamp.to_utc_datetime()).split()[1]
curr_time = str(datetime.utcnow()).split()[1].split(".")[0]
print(f"sleep_and_con {curr_time}-{timestamp}-{window_start}-{window_end}")
return [element]
def print_end(e):
print("end")
(p
| beam.io.ReadFromPubSub(subscription="...")
| beam.Map(lambda x: ('a', x))
| beam.WindowInto(
window.FixedWindows(5),
trigger=Repeatedly(AfterProcessingTime(20)),
accumulation_mode=False
)
| beam.GroupByKey()
| beam.ParDo(SleepAndCon())
| beam.Map(print_end)
)
After running and sending 5 messages every 5 seconds, we got the following output:
sleep_and_con 13:58:19-13:57:39.999999-13:57:35-13:57:40
sleep_and_con 13:58:19-13:57:44.999999-13:57:40-13:57:45
sleep_and_con 13:58:19-13:57:49.999999-13:57:45-13:57:50
sleep_and_con 13:58:19-13:57:54.999999-13:57:50-13:57:55
sleep_and_con 13:58:19-13:57:59.999999-13:57:55-13:58:00
end
end
end
end
end
But this behavior happens to me also without the AfterProcessingTime trigger, randomly (I didn't manage to understand when exactly it happens)
My question is - is there a way to block a window trigger until a previous window finishes the pipeline ?
Thanks!
Related
I'm trying to count kafka message key, by using direct runner.
If I put max_num_records =20 in ReadFromKafka, I can see results printed or outputed to text.
like:
('2102', 5)
('2706', 5)
('2103', 5)
('2707', 5)
But without max_num_records, or if max_num_records is larger than message count in kafka topic, the program keeps running but nothing is outputed.
If I try to output with beam.io.WriteToText, there will be an empty temp folder created, like:
beam-temp-StatOut-d16768eadec511eb8bd897b012f36e97
Terminal shows:
2.30.0: Pulling from apache/beam_java8_sdk
Digest: sha256:720144b98d9cb2bcb21c2c0741d693b2ed54f85181dbd9963ba0aa9653072c19
Status: Image is up to date for apache/beam_java8_sdk:2.30.0
docker.io/apache/beam_java8_sdk:2.30.0
If I put 'enable.auto.commit': 'true' in kafka consumer config, the messages are commited, other clients from the same group can't read them, so I assume it's reading succesfully, just not processing or outputing.
I tried Fixed-time, Sliding time windowing, with or without different trigger, nothing changes.
Tried flink runner, got same result as direct runner.
No idea what I did wrong, any help?
environment:
centos 7
anaconda
python 3.8.8
java 1.8.0_292
beam 2.30
code as below:
direct_options = PipelineOptions([
"--runner=DirectRunner",
"--environment_type=LOOPBACK",
"--streaming",
])
direct_options.view_as(SetupOptions).save_main_session = True
direct_options.view_as(StandardOptions).streaming = True
conf = {'bootstrap.servers': '192.168.75.158:9092',
'group.id': "g17",
'enable.auto.commit': 'false',
'auto.offset.reset': 'earliest'}
if __name__ == '__main__':
with beam.Pipeline(options = direct_options) as p:
msg_kv_bytes = ( p
| 'ReadKafka' >> ReadFromKafka(consumer_config=conf,topics=['LaneIn']))
messages = msg_kv_bytes | 'Decode' >> beam.MapTuple(lambda k, v: (k.decode('utf-8'), v.decode('utf-8')))
counts = (
messages
| beam.WindowInto(
window.FixedWindows(10),
trigger = AfterCount(1),#AfterCount(4),#AfterProcessingTime
# allowed_lateness=3,
accumulation_mode = AccumulationMode.ACCUMULATING) #ACCUMULATING #DISCARDING
# | 'Windowsing' >> beam.WindowInto(window.FixedWindows(10, 5))
| 'TakeKeyPairWithOne' >> beam.MapTuple(lambda k, v: (k, 1))
| 'Grouping' >> beam.GroupByKey()
| 'Sum' >> beam.MapTuple(lambda k, v: (k, sum(v)))
)
output = (
counts
| 'Print' >> beam.ParDo(print)
# | 'WriteText' >> beam.io.WriteToText('/home/StatOut',file_name_suffix='.txt')
)
There are couple of known issues that you might be running into.
Beam's portable DirectRunner currently does not fully support streaming. Relevant Jira to follow is https://issues.apache.org/jira/browse/BEAM-7514
Beam's portable runners (including DirectRunner) has a known issue where streaming sources do not properly emit messages. Hence max_num_records or max_read_time arguments have to be provided to convert such sources to bounded sources. Relevant Jira to follow is https://issues.apache.org/jira/browse/BEAM-11998.
need some help. I have some trivial task reading from Pub/Sub and write to batch file in GCS, but have some struggle with fileio.WriteToFiles
with beam.Pipeline(options=pipeline_options) as p:
input = (p | 'ReadData' >> beam.io.ReadFromPubSub(topic=known_args.input_topic).with_output_types(bytes)
| "Decode" >> beam.Map(lambda x: x.decode('utf-8'))
| 'Parse' >> beam.Map(parse_json)
| ' data w' >> beam.WindowInto(
FixedWindows(60),
accumulation_mode=AccumulationMode.DISCARDING
))
event_data = (input
| 'filter events' >> beam.Filter(lambda x: x['t'] == 'event')
| 'encode et' >> beam.Map(lambda x: json.dumps(x))
| 'write events to file' >> fileio.WriteToFiles(
path='gs://extention/ga_analytics/events/', shards=0))
I need one file after my window fires, but the number of files is equal to the number of messages from Pubsub, can anyone help me?
current output files
but i need only one file.
I recently ran into this issue and dug into the source code:
fileio.WriteToFiles will attempt to output each element in the bundle as an individual file. WireToFiles will only fallback to writing to sharded files if the number of elements exceeds the number of writers.
To force a single file to be created for all elements in the window, set max_writers_per_bundle to 0
WriteToFiles(shards=1, max_writers_per_bundle=0)
You can pass WriteToFiles(shards=1, ...) to limit the output to a single shard per window and triggering.
We have an app that has users; each user uses our app for something like 10-40 minutes per go and I would like to count the distribution/occurrences of events happing per-such-session, based on specific events having happened (e.g. "this user converted", "this user had a problem last session", "this user had a successful last session").
(After this I'd like to count these higher-level events per day, but that's a separate question)
For this I've been looking into session windows; but all docs seem geared towards global session windows, but I'd like to create them per-user (which is also a natural partitioning).
I'm having trouble finding docs (python preferred) on how to do this. Could you point me in the right direction?
Or in other words: How do I create per-user per-session windows that can output more structured (enriched) events?
What I have
class DebugPrinter(beam.DoFn):
"""Just prints the element with logging"""
def process(self, element, window=beam.DoFn.WindowParam):
_, x = element
logging.info(">>> Received %s %s with window=%s", x['jsonPayload']['value'], x['timestamp'], window)
yield element
def sum_by_event_type(user_session_events):
logging.debug("Received %i events: %s", len(user_session_events), user_session_events)
d = {}
for key, group in groupby(user_session_events, lambda e: e['jsonPayload']['value']):
d[key] = len(list(group))
logging.info("After counting: %s", d)
return d
# ...
by_user = valid \
| 'keyed_on_user_id' >> beam.Map(lambda x: (x['jsonPayload']['userId'], x))
session_gap = 5 * 60 # [s]; 5 minutes
user_sessions = by_user \
| 'user_session_window' >> beam.WindowInto(beam.window.Sessions(session_gap),
timestamp_combiner=beam.window.TimestampCombiner.OUTPUT_AT_EOW) \
| 'debug_printer' >> beam.ParDo(DebugPrinter()) \
| beam.CombinePerKey(sum_by_event_type)
What it outputs
INFO:root:>>> Received event_1 2019-03-12T08:54:29.200Z with window=[1552380869.2, 1552381169.2)
INFO:root:>>> Received event_2 2019-03-12T08:54:29.200Z with window=[1552380869.2, 1552381169.2)
INFO:root:>>> Received event_3 2019-03-12T08:54:30.400Z with window=[1552380870.4, 1552381170.4)
INFO:root:>>> Received event_4 2019-03-12T08:54:36.300Z with window=[1552380876.3, 1552381176.3)
INFO:root:>>> Received event_5 2019-03-12T08:54:38.100Z with window=[1552380878.1, 1552381178.1)
So as you can see; the Session() window doesn't expand the Window, but groups only very close events together... What's being done wrong?
You can get it to work by adding a Group By Key transform after the windowing. You have assigned keys to the records but haven't actually grouped them together by key and session windowing (which works per-key) does not know that these events need to be merged together.
To confirm this I did a reproducible example with some in-memory dummy data (to isolate Pub/Sub from the problem and be able to test it more quickly). All five events will have the same key or user_id but they will "arrive" sequentially 1, 2, 4 and 8 seconds apart from each other. As I use session_gap of 5 seconds I expect the first 4 elements to be merged into the same session. The 5th event will take 8 seconds after the 4th one so it has to be relegated to the next session (gap over 5s). Data is created like this:
data = [{'user_id': 'Thanos', 'value': 'event_{}'.format(event), 'timestamp': time.time() + 2**event} for event in range(5)]
We use beam.Create(data) to initialize the pipeline and beam.window.TimestampedValue to assign the "fake" timestamps. Again, we are just simulating streaming behavior with this. After that, we create the key-value pairs thanks to the user_id field, we window into window.Sessions and, we add the missing beam.GroupByKey() step. Finally, we log the results with a slightly modified version of DebugPrinter:. The pipeline now looks like this:
events = (p
| 'Create Events' >> beam.Create(data) \
| 'Add Timestamps' >> beam.Map(lambda x: beam.window.TimestampedValue(x, x['timestamp'])) \
| 'keyed_on_user_id' >> beam.Map(lambda x: (x['user_id'], x))
| 'user_session_window' >> beam.WindowInto(window.Sessions(session_gap),
timestamp_combiner=window.TimestampCombiner.OUTPUT_AT_EOW) \
| 'Group' >> beam.GroupByKey()
| 'debug_printer' >> beam.ParDo(DebugPrinter()))
where DebugPrinter is:
class DebugPrinter(beam.DoFn):
"""Just prints the element with logging"""
def process(self, element, window=beam.DoFn.WindowParam):
for x in element[1]:
logging.info(">>> Received %s %s with window=%s", x['value'], x['timestamp'], window)
yield element
If we test this without grouping by key we get the same behavior:
INFO:root:>>> Received event_0 1554117323.0 with window=[1554117323.0, 1554117328.0)
INFO:root:>>> Received event_1 1554117324.0 with window=[1554117324.0, 1554117329.0)
INFO:root:>>> Received event_2 1554117326.0 with window=[1554117326.0, 1554117331.0)
INFO:root:>>> Received event_3 1554117330.0 with window=[1554117330.0, 1554117335.0)
INFO:root:>>> Received event_4 1554117338.0 with window=[1554117338.0, 1554117343.0)
But after adding it, the windows now work as expected. Events 0 to 3 are merged together in an extended 12s session window. Event 4 belongs to a separate 5s session.
INFO:root:>>> Received event_0 1554118377.37 with window=[1554118377.37, 1554118389.37)
INFO:root:>>> Received event_1 1554118378.37 with window=[1554118377.37, 1554118389.37)
INFO:root:>>> Received event_3 1554118384.37 with window=[1554118377.37, 1554118389.37)
INFO:root:>>> Received event_2 1554118380.37 with window=[1554118377.37, 1554118389.37)
INFO:root:>>> Received event_4 1554118392.37 with window=[1554118392.37, 1554118397.37)
Full code here
Two additional things worth mentioning. The first one is that, even if running this locally in a single machine with the DirectRunner, records can come unordered (event_3 is processed before event_2 in my case). This is done on purpose to simulate distributed processing as documented here.
The last one is that if you get a stack trace like this:
TypeError: Cannot convert GlobalWindow to apache_beam.utils.windowed_value._IntervalWindowBase [while running 'Write Results/Write/WriteImpl/WriteBundles']
downgrade from 2.10.0/2.11.0 SDK to 2.9.0. See this answer for example.
Trying to get CPU usage in Python without using PSUtil.
I've tried the following but it always seems to report the same figure...
def getCPUuse():
return(str(os.popen("top -n1 | awk '/Cpu\(s\):/ {print $2}'").readline().strip(\
)))
print(getCPUuse())
This always seems to report 3.7% even when I load up the CPU.
I have also tried the following...
str(round(float(os.popen('''grep 'cpu ' /proc/stat | awk '{usage=($2+$4)*100/($2+$4+$5)} END {print usage }' ''').readline()),2))
This always seems to return 5.12. Must admit I don't really know what the above does. If I enter grep cpu /proc/stat into the command line I get something like this...
cpu 74429 1 19596 1704779 5567 0 284 0 0 0
cpu0 19596 0 4965 422508 1640 0 279 0 0 0
cpu1 18564 1 4793 427115 1420 0 1 0 0 0
cpu2 19020 0 4861 426916 1206 0 2 0 0 0
cpu3 17249 0 4977 428240 1301 0 2 0 0 0
I'm guessing my command isn't properly extracting the values for all of my CPU cores from the above output?
My objective is to get total CPU % from my device (Raspberry PI) without using PSUtil. The figure should reflect what is displayed in the OS Task Manager.
What PSUtil, htop, mpstat and the like do, is reading the line starting with "cpu" (actually the first line) from /proc/stat, and then calculate a percentage from the values in that line. You can find the meaning of the values on that line in man 5 proc (search for "proc/stat").
That's also what the grep cpu /proc/stat | awk .... command you mentioned does. But the values in /proc/stat represent the times spent since last boot! Maybe they wrap around after a while, I'm not sure, but the point is that these are numbers measured over a really long time.
So if you run that command, and run it again a few seconds (, minutes or even hours) later, they won't have changed much! That's why you saw it always return 5.12.
Programs like top remember the previous values and subtract them from the newly read values. From the result a 'live' percentage can be calculated that actually reflect recent CPU load.
To do something like that in python as simply as possible, but without running external commands to read /proc/stat and do the calculations for us, we can store the values we've read into a file. The next run we can read them back in, and subtract them from the new values.
#!/usr/bin/env python2.7
import os.path
# Read first line from /proc/stat. It should start with "cpu"
# and contains times spent in various modes by all CPU's totalled.
#
with open("/proc/stat") as procfile:
cpustats = procfile.readline().split()
# Sanity check
#
if cpustats[0] != 'cpu':
raise ValueError("First line of /proc/stat not recognised")
#
# Refer to "man 5 proc" (search for /proc/stat) for information
# about which field means what.
#
# Here we do calculation as simple as possible:
# CPU% = 100 * time_doing_things / (time_doing_things + time_doing_nothing)
#
user_time = int(cpustats[1]) # time spent in user space
nice_time = int(cpustats[2]) # 'nice' time spent in user space
system_time = int(cpustats[3]) # time spent in kernel space
idle_time = int(cpustats[4]) # time spent idly
iowait_time = int(cpustats[5]) # time spent waiting is also doing nothing
time_doing_things = user_time + nice_time + system_time
time_doing_nothing = idle_time + iowait_time
# The times read from /proc/stat are total times, i.e. *all* times spent
# doing things and doing nothing since last boot.
#
# So to calculate meaningful CPU % we need to know how much these values
# have *changed*. So we store them in a file which we read next time the
# script is run.
#
previous_values_file = "/tmp/prev.cpu"
prev_time_doing_things = 0
prev_time_doing_nothing = 0
try:
with open(previous_values_file) as prev_file:
prev1, prev2 = prev_file.readline().split()
prev_time_doing_things = int(prev1)
prev_time_doing_nothing = int(prev2)
except IOError: # To prevent error/exception if file does not exist. We don't care.
pass
# Write the new values to the file to use next run
#
with open(previous_values_file, 'w') as prev_file:
prev_file.write("{} {}\n".format(time_doing_things, time_doing_nothing))
# Calculate difference, i.e: how much the number have changed
#
diff_time_doing_things = time_doing_things - prev_time_doing_things
diff_time_doing_nothing = time_doing_nothing - prev_time_doing_nothing
# Calculate a percentage of change since last run:
#
cpu_percentage = 100.0 * diff_time_doing_things/ (diff_time_doing_things + diff_time_doing_nothing)
# Finally, output the result
#
print "CPU", cpu_percentage, "%"
Here's a version that, not unlike top, prints CPU usage every second, remembering CPU times from previous measurement in variables instead of a file:
#!/usr/bin/env python2.7
import os.path
import time
def get_cpu_times():
# Read first line from /proc/stat. It should start with "cpu"
# and contains times spend in various modes by all CPU's totalled.
#
with open("/proc/stat") as procfile:
cpustats = procfile.readline().split()
# Sanity check
#
if cpustats[0] != 'cpu':
raise ValueError("First line of /proc/stat not recognised")
# Refer to "man 5 proc" (search for /proc/stat) for information
# about which field means what.
#
# Here we do calculation as simple as possible:
#
# CPU% = 100 * time-doing-things / (time_doing_things + time_doing_nothing)
#
user_time = int(cpustats[1]) # time spent in user space
nice_time = int(cpustats[2]) # 'nice' time spent in user space
system_time = int(cpustats[3]) # time spent in kernel space
idle_time = int(cpustats[4]) # time spent idly
iowait_time = int(cpustats[5]) # time spent waiting is also doing nothing
time_doing_things = user_time + nice_time + system_time
time_doing_nothing = idle_time + iowait_time
return time_doing_things, time_doing_nothing
def cpu_percentage_loop():
prev_time_doing_things = 0
prev_time_doing_nothing = 0
while True: # loop forever printing CPU usage percentage
time_doing_things, time_doing_nothing = get_cpu_times()
diff_time_doing_things = time_doing_things - prev_time_doing_things
diff_time_doing_nothing = time_doing_nothing - prev_time_doing_nothing
cpu_percentage = 100.0 * diff_time_doing_things/ (diff_time_doing_things + diff_time_doing_nothing)
# remember current values to subtract next iteration of the loop
#
prev_time_doing_things = time_doing_things
prev_time_doing_nothing = time_doing_nothing
# Output latest perccentage
#
print "CPU", cpu_percentage, "%"
# Loop delay
#
time.sleep(1)
if __name__ == "__main__":
cpu_percentage_loop()
That's not really easy, since most of the process you describe provide the cumulative, or total average of the CPU usage.
Maybe you can try to use the mpstat command that comes with te systat package.
So, the steps I used for the following script are:
Ask mpstat to generate 2 reports, one right now and the other after 1 second (mpstat 1 2)
Then we get the Average line (the last line)
The last column is the %idle column, so we get that with the $NF variable from awk
We use Popen from subprocess but setting shell=True to accept our pipes (|) options.
We execute the command (communicate())
Clean the output with a strip
And subtract that (the idle percentage) from 100, so we can get the used value.
Since it will sleep for 1 second, don't be worry that it is not an instant command.
import subprocess
cmd = "mpstat 1 2 | grep Average | awk '{print $NF}'"
p = subprocess.Popen(cmd, stdout=subprocess.PIPE, shell=True)
out, err = p.communicate()
idle = float(out.strip())
print(100-idle)
I am currently working on a robot that has to traverse a maze.
For the robot I am using a TMC222 Stepper controller and the software is coded in Python.
I am in need of a function which can tell me when the motors are busy so that the robot will seize all other activity while the motors are running.
My idea is to check the current position on the motors and compare it to the target position, but i haven't gotten it to work yet.
My current attempt:
def isRunning(self):
print("IS RUNNING TEST")
fullstatus=self.getFullStatus2()
#print("FULL STATUS: " + str(fullstatus[0]) + " 2 " + str(fullstatus[1]))
actLeft=fullstatus[0][1]<<8 | fullstatus[0][2]<<0
actRight=fullstatus[1][1]<<8 | fullstatus[1][2]<<0
tarLeft=fullstatus[0][3]<<8 | fullstatus[0][4]<<0
tarRight=fullstatus[1][3]<<8 | fullstatus[1][4]<<0
value = (actLeft==tarLeft) and (actRight==tarRight)
value = not value
# print("isbusy="+str(value))
print 'ActPos = ' + str(actLeft)
print 'TarPos = ' + str(tarLeft)
return value
It would be helpful to see your getFullStatus2() code as well, since it's unclear to me how you're getting a multidimensional output.
In general, you can form a 16-bit "word" from two 8-bit bytes just as you have it:
Word = HB << 8 | LB << 0
Where HB and LB are the high (bits 15-8) and low (bits 7-0) bytes.
That being said, there are multiple ways to detect motor stall. The ideal way would be an external pressure switch that closed when it hit a wall. Another would be to monitor the motor's current, when the motor faces resistance (when accelerating or in stall), the current will rise.
Since it looks like neither of these are possible, I'd use still a different approach, monitoring the motor's position (presumably from some sort of encoder) over time.
Lets say you have a function get_position() that returns an unsigned 16-bit integer. You should be able to write something like:
class MotorPosition(object):
def __init__(self):
readings = []
def poll(self):
p = get_position()
self.readings.append(readings)
# If the list is now too long, remove the oldest entries
if len(self.readings) > 5:
self.readings.pop(0)
def get_deltas():
deltas = []
for x,y in zip(self.readings[1:4], self.readings[0:3]):
d = x - y
# Wraparound detection
if (d < -THRESHOLD): d += 65536
elif(d > THRESHOLD): d -= 65536
deltas.append(d)
return deltas
def get_average_delta():
deltas = self.get_deltas()
return sum(deltas) / float(len(deltas))
Note that this assumes you're polling the encoder fast enough and with consistent frequency.
You could then monitor the average delta (from get_average_delta()) and if it drops below some value, you consider the motor stalled.
Assumptions:
This is the datasheet for the controller you're using
Your I²C code is working correctly