I am trying to simulate a network link that is used by many nodes to communicate on SimPy (a python library). I have taken the network as simpy.Resource(env, 1) because all communication channels are FIFO in general. Whenever a node needs to send data, it requests the network resource and then transfers data.
def transfer(env,data_size):
net_delay = data_size/WAN_BANDWIDTH
with network_queue.request() as req:
yield req
yield env.timeout(net_delay)
But because of this, nodes transferring huge data are occupying the channel and small transfers are getting queued. I know for sure that this is not how real network transfers work. Every transfer gets an equal and divided amount of bandwidth. Any suggestions to solve this?
The following is what I came up with.
def transfer(env,transfer_size):
transfer_size_remaining = transfer_size
while(transfer_size_remaining>0):
with network_queue.request() as request:
yield request
data_size = min(transfer_size_remaining,MTU)
yield env.timeout(data_size/WAN_BANDWIDTH)
transfer_size_remaining -= data_size
I am will be requesting the network queue and MTU (1500 Bytes) amount of bytes with every request. I think this should automatically make it round-robin and diving equal amount of bandwidth between all the nodes that are transferring data.
Is anything wrong with my solution? Are there any better ways to do it? Or is there a standard right way to do this with SimPy?
Thanks in advance!
I think your solution looks good. I think that is what my first try would look like. I am not seeing the issue Ye is trying to raise
The one thing you might want to consider is using a priority resource where a node's first transfer gets a higher priority then its later transfers. The benefit is small one transfer jobs will not get hit with big queue times if your queue is filled with a bunch of huge multi transfer jobs. The downside is if your workload is made up of mostly small jobs, then the big jobs will never get any resource time since the small jobs will have higher priority. But that problem can be fixed by adding a interrupt to re-submit the big jobs with a higher priority if the big job sits in the queue too long. Of course the best workload strategy (and there is a lot of them) depends on the make up of the work load.
You should create another process to realise the timeout. Otherwise, it will be blocked by the first yield.
Related
I develop a highly loaded application that reads data from DynamoDB on-demand table. Let's say it constantly performs around 500 reads per second.
From time to time I need to upload a large dataset into the database (100 million records). I use python, spark and audienceproject/spark-dynamodb. I set throughput=40k and use BatchWriteItem() for data writing.
In the beginning, I observe some write throttled requests and write capacity is only 4k but then upscaling takes place, and write capacity goes up.
Questions:
Does intensive writing affects reading in the case of on-demand tables? Does autoscaling work independently for reading/writing?
Is it fine to set large throughput for a short period of time? As far as I see the cost is the same in the case of on-demand tables. What are the potential issues?
I observe some throttled requests but eventually, all the data is successfully uploaded. How can this be explained? I suggest that the client I use has advanced rate-limiting logic and I didn't manage to find a clear answer so far.
That's a lot of questions in one question, you'll get a high level answer.
DynamoDB scales by increasing the number of partitions. Each item is stored on a partition. Each partition can handle:
up to 3000 Read Capacity Units
up to 1000 Write Capacity Units
up to 10 GB of data
As soon as any of these limits is reached, the partition is split into two and the items are redistributed. This happens until there is sufficient capacity available to meet demand. You don't control how that happens, it's a managed service that does this in the background.
The number of partitions only ever grows.
Based on this information we can address your questions:
Does intensive writing affects reading in the case of on-demand tables? Does autoscaling work independently for reading/writing?
The scaling mechanism is the same for read and write activity, but the scaling point differs as mentioned above. In an on-demand table AutoScaling is not involved, that's only for tables with provisioned throughput. You shouldn't notice an impact on your reads here.
Is it fine to set large throughput for a short period of time? As far as I see the cost is the same in the case of on-demand tables. What are the potential issues?
I assume you set the throughput that spark can use as a budget for writing, it won't have that much of an impact on on-demand tables. It's information, it can use internally to decide how much parallelization is possible.
I observe some throttled requests but eventually, all the data is successfully uploaded. How can this be explained? I suggest that the client I use has advanced rate-limiting logic and I didn't manage to find a clear answer so far.
If the client uses BatchWriteItem, it will get a list of items that couldn't be written for each request and can enqueue them again. Exponential backoff may be involved but that is an implementation detail. It's not magic, you just have to keep track of which items you've successfully written and enqueue those that you haven't again until the "to-write" queue is empty.
I am planing to setup a small proxy service for a remote sensor, that only accepts one connection. I have a temporary solution and I am now designing a more robust version, and therefore dived deeper into the python multiprocessing module.
I have written a couple of systems in python using a main process, which spawns subprocesses using the multiprocessing module and used multiprocessing.Queue to communicate between them. This works quite well and some of theses programs/scripts are doing their job in a production environment.
The new case is slightly different since it uses 2+n processes:
One data-collector, that reads data from the sensor (at 100Hz) and every once in a while receives short ASCII strings as command
One main-server, that binds to a socket and listens, for new connections and spawns...
n child-servers, that handle clients who want to have the sensor data
while communication from the child servers to the data collector seems pretty straight forward using a multiprocessing.Queue which manages a n:1 connection well enough, I have problems with the other way. I can't use a queue for that as well, because all child-servers need to get all the data the sensor produces, while they are active. At least I haven't found a way to configure a Queue to mimic that behaviour, as get takes the top most out of the Queue by design.
I looked into shared memory already, which massively increases the management overhead, since as far as I understand it while using it, I would basically need to implement a streaming buffer myself.
The only safe way I see right now, is using a redis server and messages queues, but I am a bit hesitant, since that would need more infrastructure than I like.
Is there a pure python internal way?
maybe You can use MQTT for that ?
You did not clearly specify, but sounds like observer pattern -
or do You want the clients to poll each time they need data ?
It depends which delays / data rate / jitter etc. You can accept.
after You provided the information :
The whole setup runs on one machine in one process space. What I would like to have, is a way without going through a third party process
I would suggest to check for observer pattern.
More informations can be found for example:
https://www.youtube.com/watch?v=_BpmfnqjgzQ&t=1882s
and
https://refactoring.guru/design-patterns/observer/python/example
and
https://www.protechtraining.com/blog/post/tutorial-the-observer-pattern-in-python-879
and
https://python-3-patterns-idioms-test.readthedocs.io/en/latest/Observer.html
Your Server should fork for each new connection and register with the observer, and will be therefore informed about every change.
I have a function I'm calling with multiprocessing.Pool
Like this:
from multiprocessing import Pool
def ingest_item(id):
# goes and does alot of network calls
# adds a bunch to a remote db
return None
if __name__ == '__main__':
p = Pool(12)
thing_ids = range(1000000)
p.map(ingest_item, thing_ids)
The list pool.map is iterating over contains around 1 million items,
for each ingest_item() call it will go and call 3rd party services and add data to a remote Postgresql database.
On a 12 core machine this processes ~1,000 pool.map items in 24 hours. CPU and RAM usage is low.
How can I make this faster?
Would switching to Threads make sense as the bottleneck seems to be network calls?
Thanks in advance!
First: remember that you are performing a network task. You should expect your CPU and RAM usage to be low, because the network is orders of magnitude slower than your 12-core machine.
That said, it's wasteful to have one process per request. If you start experiencing issues from starting too many processes, you might try pycurl, as suggested here Library or tool to download multiple files in parallel
This pycurl example looks very similar to your task https://github.com/pycurl/pycurl/blob/master/examples/retriever-multi.py
It is unlikely that using threads will substantially improve performance. This is because no matter how much you break up the task all requests have to go through the network.
To improve performance you might want to see if the 3rd party services have some kind of bulk request API with better performance.
If your workload permits it you could attempt to use some kind of caching. However, from your explanation of the task it sounds like that would have little effect since you're primarily sending data, not requesting it. You could also consider caching open connections (If you aren't already doing so), this helps avoid the very slow TCP handshake. This type of caching is often used in web browsers (Eg. Chrome).
Disclaimer: I have no Python experience
I'm working on a project to parallelize some heavy simulation jobs. Each run takes about two minutes, takes 100% of the available CPU power, and generates over 100 MB of data. In order to execute the next step of the simulation, those results need to be combined into one huge result.
Note that this will be run on performant systems (currently testing on a machine with 16 GB ram and 12 cores, but will probably upgrade to bigger HW)
I can use a celery job group to easily dispatch about 10 of these jobs, and then chain that into the concatenation step and the next simulation. (Essentially a Celery chord) However, I need to be able to run at least 20 on this machine, and eventually 40 on a beefier machine. It seems that Redis doesn't allow for large enough objects on the result backend for me to do anything more than 13. I can't find any way to change this behavior.
I am currently doing the following, and it works fine:
test_a_group = celery.group(test_a(x) for x in ['foo', 'bar'])
test_a_result = rev_group.apply_async(add_to_parent=False)
return = test_b(test_a_result.get())
What I would rather do:
return chord(test_a_group, test_b())
The second one works for small datasets, but not large ones. It gives me a non-verbose 'Celery ChordError 104: connection refused' with large data.
Test B returns very small data, essentially a pass fail, and I am only passing the group result into B, so it should work, except that I think the entire group is being appended to the result of B, in the form of parent, making it too big. I can't find out how to prevent this from happening.
The first one works great, and I would be okay, except that it complains, saying:
[2015-01-04 11:46:58,841: WARNING/Worker-6] /home/cmaclachlan/uriel-venv/local/lib/python2.7/site-packages/celery/result.py:45:
RuntimeWarning: Never call result.get() within a task!
See http://docs.celeryq.org/en/latest/userguide/tasks.html#task-synchronous-subtasks
In Celery 3.2 this will result in an exception being
raised instead of just being a warning.
warnings.warn(RuntimeWarning(E_WOULDBLOCK))
What the link essentially suggests is what I want to do, but can't.
I think I read somewhere that Redis has a limit of 500 mb on size of data pushed to it.
Any advice on this hairiest of problems?
Celery isn't really designed to address this problem directly. Generally speaking, you want to keep the inputs/outputs of tasks small.
Every input or output has to be serialized (pickle by default) and transmitted through the broker, such as RabbitMQ or Redis. Since the broker needs to queue the messages when there are no clients available to handle them, you end up potentially paying the hit of writing/reading the data to disk anyway (at least for RabbitMQ).
Typically, people store large data outside of celery and just access it within the tasks by URI, ID, or something else unique. Common solutions are to use a shared network file system (as already mentioned), a database, memcached, or cloud storage like S3.
You definitely should not call .get() within a task because it can lead to deadlock.
Say I have a list of 1000 unique urls, and I need to open each one, and assert that something on the page is there. Doing this sequentially obviously is a poor choice, as most of the time the program will be sitting idle just waiting for a response. So, added in a thread pool where each worker reads from a main Queue, and opens a url to do a check. My question is, how big do I make the pool? Is it based on my network bandwidth, or some other metric? Are there any rules of thumb for this, or is it simply trial and error to find an effective size?
This is more of a theoretical question, but here's the basic outline of the code I'm using.
if __name__ == '__main__':
#get the stuff I've already checked
ID = 0
already_checked = [i[ID] for i in load_csv('already_checked.csv')]
#make sure I don't duplicate the effort
to_check = load_csv('urls_to_check.csv')
links = [url[:3] for url in to_check if i[ID] not in already_checked]
in_queue = Queue.Queue()
out_queue = Queue.Queue()
threads = []
for i in range(5):
t = SubProcessor(in_queue, out_queue)
t.setDaemon(True)
t.start()
threads.append(t)
writer = Writer(out_queue)
writer.setDaemon(True)
writer.start()
for link in links:
in_queue.put(link)
Your best bet is probably to write some code that runs some tests using the number of threads you specify, and see how many threads produce the best result. There are too many variables (speed of processor, speed of the buses, thread overhead, number of cores, and the nature of the code itself) for us to hazard a guess.
My experience (using .NET, but it should apply to any language) is that DNS resolution ends up being the limiting factor. I found that a maximum of 15 to 20 concurrent requests is all that I could sustain. DNS resolution is typically very fast, but sometimes can take hundreds of milliseconds. Without some custom DNS caching or other way to quickly do the resolution, I found that it averages about 50 ms.
If you can do multi-threaded DNS resolution, 100 or more concurrent requests is certainly possible on modern hardware (a quad-core machine). How your OS handles that many individual threads is another question entirely. But, as you say, those threads are mostly doing nothing but waiting for responses. The other consideration is how much work those threads are doing. If it's just downloading a page and looking for something specific, 100 threads is probably well within the bounds of reason. Provided that "looking" doesn't involve much more than just parsing an HTML page.
Other considerations involve the total number of unique domains you're accessing. If those 1,000 unique URLs are all from the different domains (i.e. 1,000 unique domains), then you have a worst case scenario: every request will require a DNS resolution (a cache miss).
If those 1,000 URLs represent only 100 domains, then you'll only have 100 cache misses. Provided that your machine's DNS cache is reasonable. However, you have another problem: hitting the same server with multiple concurrent requests. Some servers will be very unhappy if you make many (sometimes "many" is defined as "two or more") concurrent requests. Or too many requests over a short period of time. So you might have to write code to prevent multiple or more-than-X concurrent requests to the same server. It can get complicated.
One simple way to prevent the multiple requests problem is to sort the URLs by domain and then ensure that all the URLs from the same domain are handled by the same thread. This is less than ideal from a performance perspective, because you'll often find that one or two domains have many more URLs than the others, and you'll end up with most of the threads ended while those few are plugging away at their very busy domains. You can alleviate these problems by examining your data and assigning the threads' work items accordingly.