AWS IOT Inconsistent results from multiple sequential requests

AWS IOT Inconsistent results from multiple sequential requests - python

I'm writing a simple unittest which is running:
botocore.client.IoT.search_index(queryString='connectivity.connected:true')
My unittest simply connects a device, subscribes to MQTT, sends and receives a test message. This gives me reason to trust the device is truly online.
Sometimes my unit test passes, sometimes fails. When I come into a debugger and run the search_index command repeatedly I see inconsistent results between calls. Sometimes the device I just connected is online, sometimes it's not, after 20ish seconds the device appears to be consistently online.
I believe I'm probably getting responses from different servers and the propagation of the connected state between servers is simply delayed on the AWS side.
If my assessment is correct, then I want to know if there's anything I can do to force a consistent state between calls. Coding around this kind of inconsistent behavior is extremely error prone and almost certain to induce very hard to track bugs. Plus I don't trust that many other requests I'm making to AWS IOT are safe to rely on. In short I'm not going to do it, I'll find a better solution if there's no way to force AWS IOT to provide a consistent state between calls.

Giving that you are unit testing your code, what you can do is mock botocore.client.IoT.search_index response. Verify the patch function from unittest.mock library (https://docs.python.org/3/library/unittest.mock.html). For your case it must be something like:
#patch('botocore.client.IoT')
def test(iotMockedClass):
iotMockedClass.search_index.return_value = 'a fixed value that you must define'
# ... your unit test case
It's also important to mention that unit tests should not depend on the environment, which must be replaced with mocks or stubs.

Related

Testing a function that has some code dependent on a service that's unavailable during local testing?

I'm using pytest to test a Flask API that also makes use of MQTT. I'm late to the TDD game so this question could well be answered elsewhere but I can't think of the right way to frame it, so can't find anything that seems to answer it for me. This question that deals with mocking during integration tests sounds like it's on the right track.
Basically one of my tests looks like this
response = testing_client.post(
"/api/users/1/devices",
data=json.dumps(dict(device)),
headers={"Authorization": "Token %s" % auth_token},
content_type="application/json",
)
assert response.status_code == 200
The problem is that the corresponding bit of code that handles these POST requests also publishes an MQTT message when the request has been handled i.e.
publish.single(f"{device_id}/active", msg, hostname=os.environ.get("MQTT_BROKER_URL"), retain=True, qos=0, port=int(os.environ.get("MQTT_BROKER_PORT")))
The API and MQTT broker are separate containers managed using Docker compose. When I'm testing locally there is no MQTT broker running and so any test here fails (even though I don't actually care about testing the MQTT code).
For any calls to the DB (Postgres) I actually set up a specific postgres container for testing and run tests against this. Should I do the same thing for MQTT testing (I would then also have to do it during the CI pipeline on GitLab), or is there a much more obvious solution that I am missing?

You should definitely mock the test and create a mock return value. Imagine having 100 tests that perform a HTTP request, that becomes way too slow in your continuous integration progress (scales terribly).
Your code should handle both if the requests fails and if it's successful. For one case you mock the result as a success, and in the other as a failure. Then you test how your code would interact with the result.
monkeypatch.setattr("testing_client.post", lambda *args: some_return_value)

I tend to think mocking isn't very useful testing, because how do you know the mock is the same as the real thing? So another option for fast tests is a verified fake.
A verified fake is an object with same interface as real thing, and a set of tests validating that the fake and the real thing follow the same contract. https://pythonspeed.com/articles/verified-fakes/ has more details.
For integration tests using MQTT in another container may well be fine. The speed concern is quite often overdone—e.g. for databases (and maybe MQTT as well) you can usually make them run very fast in tests by disabling fsync. See https://pythonspeed.com/articles/faster-db-tests/

Django Selenium test fails sometimes in Travis CI

My team has puzzled at this issue on and off for weeks now. We have a test suite using LiveServerTestCase that runs all the Selenium-based tests that we have. One test in particular will seemingly randomly fail for no reason sometimes--I could change a comment in a different file and the test would fail. Changing some other comment would fix the test again. We are using the Firefox webdriver for the Selenium tests:
self.driver = Firefox()
Testing locally inside our Docker container can never reproduce the error. This is most likely due to the fact that when tests.py is run outside of Travis CI, a different web driver is used than Firefox(). The web driver instead is as such:
self.driver = WebDriver("http://selenium:4444/wd/hub", desired_capabilities={'browserName':'firefox'})
For local testing, we use a Selenium container.
The test that fails is a series of sub-tests that each tests a filtering search feature that we have; each sub-test is a different filter query. The sequence of each sub-test is:
Find the filter search bar element
Send the filter query (a string, i.e. something like "function = int main()")
Simulate the browser click to execute the query
For the specific filter on the set of data (the set of data is consistent throughout the subtests), assert that the length of the returned results matches what is expected for that specific filter
Very often this test will pass when run in Travis CI, and as noted before, this test always passes when run locally. The error cannot be reproduced when interacting with the site manually in a web browser. However, once in a while, this sort of error will appear in the test output in Travis CI:
- Broken pipe from ('127.0.0.1', 39000)
- Broken pipe from ('127.0.0.1', 39313)
39000 and 39313 are not always the numbers--these change every time a new Travis CI build is run. These seem like port numbers, though I'm not really sure what they actually are.
We have time.sleep(sec) lines right before fetching the list of results for a filter. Increasing the sleep time usually will correlate with a temporary fix of the broken pipe error. However, the test is very fickle and changing the sleep time likely does not have much to do with fixing the error at all; there have been times where the sleep time has been reduced or taken out of a subtest and the test will pass. In any case, as a result of the broken pipe, the filter cannot get executed and the assertion fails.
One potentially interesting detail is that regardless of the order of subtests, it is always the first subtest that fails if the broken pipe error occurs. If, however, the first subtest passes, then all subtests will always pass.
So, my question is: what on earth is going on here and how do we make sure that this random error stops happening? Apologies if this is a vague/confusing question, but unfortunately that is the nature of the problem.

It looks like your issue may be similar to what this fellow was running into. It's perhaps an issue with your timeouts. You may want to use an explicit wait, or try waiting for a specific element to load before comparing the data. I had similar issues with my test where my Selenium test would try polling an image to see if it was present before the page had finished loading. Like I say, this may not be the same issue, but could potentially help. Goodluck!

I just ran into this myself, and this is caused by the django's built-in server not using python's logging system. This has been fixed in 1.10 but is not released yet at the time of writing. In my case it is acceptable to leave the messages in the log until it is time to upgrade; better than adding timeouts and increasing build time.
Django ticket on the matter
Code that's causing the issue in 1.9.x

Is there a good way to split a python program into independent modules?

I'm trying to do some machinery automation with python, but I've run into a problem.
I have code that does the actual control, code that logs, code the provides a GUI, and some other modules all being called from a single script.
The issue is that an error in one module halts all the others. So, for instance a bug in the GUI will kill the control systems.
I want to be able to have the modules run independently, so one can crash, be restarted, be patched, etc without halting the others.
The only way I can find to make that work is to store the variables in an SQL database, or files or something.
Is there a way for one python script to sort of ..debug another? so that one script can read or change the variables in the other? I can't find a way to do that that also allows to scripts to be started and stopped independently.
Does anyone have any ideas or advice?

A fairly effective way to do this is to use message passing. Each of your modules are independent, but they can send and receive messages to each other. A very good reference on the many ways to achieve this in Python is the Python wiki page for parallel processing.
A generic strategy
Split your program into pieces where there are servers and clients. You could then use middleware such as 0MQ, Apache ActiveMQ or RabbitMQ to send data between different parts of the system.
In this case, your GUI could send a message to the log parser server telling it to begin work. Once it's done, the log parser will send a broadcast message to anyone interested telling the world the a reference to the results. The GUI could be a subscriber to the channel that the log parser subscribes to. Once it receives the message, it will open up the results file and display whatever the user is interested in.
Serialization and deserialization speed is important also. You want to minimise the overhead for communicating. Google Protocol Buffers and Apache Thrift are effective tools here.
You will also need some form of supervision strategy to prevent a failure in one of the servers from blocking everything. supervisord will restart things for you and is quite easy to configure. Again, it is only one of many options in this space.
Overkill much?
It sounds like you have created a simple utility. The multiprocessing module is an excellent way to have different bits of the program running fairly independently. You still apply the same strategy (message passing, no shared shared state, supervision), but with different tactics.

You want multiply independent processes, and you want them to talk to each other. Hence: read what methods of inter-process communication are available on your OS. I recommend sockets (generic, will work over a n/w and with diff OSs). You can easily invent a simple (maybe http-like) protocol on top of TCP, maybe with json for messages. There is a bunch of classes coming with Python distribution to make it easy (SocketServer.ThreadingMixIn, SocketServer.TCPServer, etc.).

Improving speed of xmlrpclib

I'm working with a device that is essentially a black box, and the only known communication method for it is XML-RPC. It works for most needs, except for when I need to execute two commands very quickly after each other. Due to the overhead and waiting for the RPC response, this is not as quick as desired.
My main question is, how does one reduce this overhead to make this functionality possible? I know the obvious solution is to ditch XML-RPC, but I don't think that's possible for this device, as I have no control over implementing any other protocols from the "server". This also makes it impossible to do a MultiCall, as I can not add valid instructions for MultiCall. Does MultiCall have to be implemented server side? For example, if I have method1(), method2(), and method3() all implemented by the server already, should this block of code work to execute them all in one reply? I'd assume no from my testing so far, as the documentation shows examples where I need to initialize commands on the server side.
server=xmlrpclib.ServerProxy(serverURL)
multicall=xmlrpclib.MultiCall(server)
multicall.method1()
multicall.method2()
mutlicall.method3()
multicall()
Also, looking through the source of xmlrpclib, I see references to a "FastParser" as opposed to a default one that is used. However, I can not determine how to enable this parser over the default. Additionally, the comment on this answer mentions that it parses one character at a time. I believe this is related, but again, no idea how to change this setting.

Unless the bulk size of your requests or responses are very large, it's unlikely that changing the parser will affect the turnaround time (since CPU is much faster than network).
You might want to consider, if possible, sending more than one command to the device without waiting for the response from the first one. If the device can handle multiple requests at once, then this may be of benefit. Even if the device only handles requests in sequence, you can still have the next request waiting at the device so that there is no delay after processing the previous one. If the device serialises requests in this way, then that's goingn to be about the best you can do.

Profiling a long-running Python Server

I have a long-running twisted server.
In a large system test, at one particular point several minutes into the test, when some clients enter a particular state and a particular outside event happens, then this server takes several minutes of 100% CPU and does its work very slowly. I'd like to know what it is doing.
How do you get a profile for a particular span of time in a long-running server?
I could easily send the server start and stop messages via HTTP if there was a way to enable or inject the profiler at runtime?
Given the choice, I'd like stack-based/call-graph profiling but even leaf sampling might give insight.

yappi profiler can be started and stopped at runtime.

There are two interesting tools that came up that try to solve that specific problem, where you might not necessarily have instrumented profiling in your code in advance but want to profile production code in a pinch.
pyflame will attach to an existing process using the ptrace(2) syscall and create "flame graphs" of the process. It's written in Python.
py-spy works by reading the process memory instead and figuring out the Python call stack. It also provides a flame graph but also a "top-like" interface to show which function is taking the most time. It's written in Rust and Python.

Not a very Pythonic answer, but maybe straceing the process gives some insight (assuming you are on a Linux or similar).
Using strictly Python, for such things I'm using tracing all calls, storing their results in a ringbuffer and use a signal (maybe you could do that via your HTTP message) to dump that ringbuffer. Of course, tracing slows down everything, but in your scenario you could switch on the tracing by an HTTP message as well, so it will only be enabled when your trouble is active as well.

Pyliveupdate is a tool designed for the purpose: profiling long running programs without restarting them. It allows you to dynamically selecting specific functions to profiling or stop profiling without instrument your code ahead of time -- it dynamically instrument code to do profiling.
Pyliveupdate have three key features:
Profile specific Python functions' (by function names or module names) call time.
Add / remove profilings without restart programs.
Show profiling results with call summary and flamegraphs.
Check out a demo here: https://asciinema.org/a/304465.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.