I am trying to fetch data of all transactions for several addresses from an API. Each address can have several pages of transactions, which I find out only when I ask for first page.
I have methods api.get_address_data(address, page) and api.get_transaction_data(tx).
Synchronous code for what I want to do would look like this:
def all_transaction_data(addresses):
for address in addresses:
data = api.get_address_data(address, page=0)
transactions = data.transactions
for n in range(1, data.total_pages):
next_page = api.get_address_data(address, page=n)
transactions += next_page.transactions
for tx in data.transactions:
yield api.get_transaction_data(tx)
I don't care about the order of transactions received (I will have to reorder them when I have all of them ready). I can fit all the data in memory, but it's a lot of very little requests, so I'd like to do as much in parallel as possible.
What is the best way to accomplish this? I was playing around with asyncio (the API calls are under my control so I can convert them to async), but I have trouble with interleaving the layers: my best solution can fetch all the addresses first, list all the pages second and finally get all transactions in one big batch. I would like each processing step to be scheduled immediately when the appropriate input data is ready, and the results collected into one big list (or yielded from a single generator).
It seems that I need some sort of open-ended task queue, where task "get-address" fetches the data and enqueues a bunch of "get-pages" tasks, which in turn enqueue "get-transaction" tasks, and only these are then collected into a list of results?
Can this be done with asyncio? Would something like gevent be more suitable, or maybe just a plain ThreadPoolExecutor? Is there a better approach than what I outlined so far?
Note that I'd like to avoid inversion of control flow, or at least hide it as an implementation detail. I.e., the caller of this code should be able to just call for tx in all_transaction_data(), or at worst async for.
Related
I want to scrape data from webpage in more efficient way. I read about concurrent futures but I have no idea how to use it in my script.
My function to take data from each link takes four arguments:
def scrape_data_for_offer(b, m, url, loc):
then it saves scraped data do pandas date frame.
It's called in a loop:
for link, location in cars_link_dict.items():
scrape_data_for_offer(brand, model, link, location)
and I want it to speed up this scraping process.
I tried to solve it like this:
with concurrent.futures.ThreadPoolExecutor(max_workers=50) as executor:
executor.map(scrape_data_for_offer, brand, model, cars_link_dict.items())
But it doesn't work, do you have any ideas of how to solve this problem?
In your futures case, you're only passing three items. The last item is a two-element tuple. So, change your function to:
def scrape_data_for_offer(b,m,info):
url, loc = info
By the way, the words are "scrape", "scraped" and "scraping". Many, many, many people are using "scrp", "scrpped" and "scr*pping", but those words all refer to throwing things away.
As another by the way, the concurrent stuff is not really going to help you. I assume you are using BeautifulSoup for the scraping. BeautifulSoup is all Python code, and the global interpreter lock means that only one of the threads will be able to execute at any given time. You'll get a tiny bit of overlap while waiting for the web site responses to be delivered.
Also, running 50 workers is pointless unless you have 50 processors. They'll all fight for resources. If you have 8 processors, use about 12 workers. In most cases, you should just leave off that parameter; it will default to the number of processors in your machine.
I'm sorry if this question has in fact been asked before. I've searched around quite a bit and found pieces of information here and there but nothing that completely helps me.
I am building an app on Google App engine in python, that lets a user upload a file, which is then being processed by a piece of python code, and then resulting processed file gets sent back to the user in an email.
At first I used a deferred task for this, which worked great. Over time I've come to realize that since the processing can take more than then 10 mins I have before I hit the DeadlineExceededError, I need to be more clever.
I therefore started to look into task queues, wanting to make a queue that processes the file in chunks, and then piece everything together at the end.
My present code for making the single deferred task look like this:
_=deferred.defer(transform_function,filename,from,to,email)
so that the transform_function code gets the values of filename, from, to and email and sets off to do the processing.
Could someone please enlighten me as to how I turn this into a linear chain of tasks that get acted on one after the other? I have read all documentation on Google app engine that I can think about, but they are unfortunately not written in enough detail in terms of actual pieces of code.
I see references to things like:
taskqueue.add(url='/worker', params={'key': key})
but since I don't have a url for my task, but rather a transform_function() implemented elsewhere, I don't see how this applies to me…
Many thanks!
You can just keep calling deferred to run your task when you get to the end of each phase.
Other queues just allow you to control the scheduling and rate, but work the same.
I track the elapsed time in the task, and when I get near the end of the processing window the code stops what it is doing, and calls defer for the next task in the chain or continues where it left off, depending if its a discrete set up steps or a continues chunk of work. This was all written back when tasks could only run for 60 seconds.
However the problem you will face (it doesn't matter if it's a normal task queue or deferred) is that each stage could fail for some reason, and then be re-run so each phase must be idempotent.
For long running chained tasks, I construct an entity in the datastore that holds the description of the work to be done and tracks the processing state for the job and then you can just keep rerunning the same task until completion. On completion it marks the job as complete.
To avoid the 10 minutes timeout you can direct the request to a backend or a B type module
using the "_target" param.
BTW, any reason you need to process the chunks sequentially? If all you need is some notification upon completion of all chunks (so you can "piece everything together at the end")
you can implement it in various ways (e.g. each deferred task for a chunk can decrease a shared datastore counter [read state, decrease and update all in the same transaction] that was initialized with the number of chunks. If the datastore update was successful and counter has reached zero you can proceed with combining all the pieces together.) An alternative for using deferred that would simplify the suggested workflow can be pipelines (https://code.google.com/p/appengine-pipeline/wiki/GettingStarted).
I'm planning to use Celery to handle sending push notifications and emails triggered by events from my primary server.
These tasks require opening a connection to an external server (GCM, APS, email server, etc). They can be processed one at a time, or handled in bulk with a single connection for much better performance.
Often there will be several instances of these tasks triggered separately in a short period of time. For example, in the space of a minute, there might be several dozen push notifications that need to go out to different users with different messages.
What's the best way of handling this in Celery? It seems like the naïve way is to simply have a different task for each message, but that requires opening a connection for each instance.
I was hoping there would be some sort of task aggregator allowing me to process e.g. 'all outstanding push notification tasks'.
Does such a thing exist? Is there a better way to go about it, for example like appending to an active task group?
Am I missing something?
Robert
I recently discovered and have implemented the celery.contrib.batches module in my project. In my opinion it is a nicer solution than Tommaso's answer, because you don't need an extra layer of storage.
Here is an example straight from the docs:
A click counter that flushes the buffer every 100 messages, or every
10 seconds. Does not do anything with the data, but can easily be
modified to store it in a database.
# Flush after 100 messages, or 10 seconds.
#app.task(base=Batches, flush_every=100, flush_interval=10)
def count_click(requests):
from collections import Counter
count = Counter(request.kwargs['url'] for request in requests)
for url, count in count.items():
print('>>> Clicks: {0} -> {1}'.format(url, count))
Be wary though, it works fine for my usage, but it mentions that is an "Experimental task class" in the documentation. This might deter some from using a feature with such a volatile description :)
An easy way to accomplish this is to write all the actions a task should take on a persistent storage (eg. database) and let a periodic job do the actual process in one batch (with a single connection).
Note: make sure you have some locking in place to prevent the queue from being processes twice!
There is a nice example on how to do something similar at kombu level (http://ask.github.com/celery/tutorials/clickcounter.html)
Personally I like the way sentry does something like this to batch increments at db level (sentry.buffers module)
I'm building an application based around a task queue: it serves a series of tasks to multiple, asynchronously connected clients. The twist is that the tasks must be served in a random order.
My problem is that the algorithm I'm using now is computationally expensive, because it relies on many large queries and transfers from the database. I have a strong hunch that there's a cheaper way to achieve the same result, but I can't quite see the solution. Can you think of a clever fix for this problem?
Here's the (computationally expensive) algorithm I'm using now:
When the client queries for a new task...
Query the database for "unfinished" tasks
Put all tasks in a list
Shuffle the list (using random.shuffle)
Flag the first task as "in progress"
Send the task parameters to the client for completion
When the client finishes the task...
6a. Record the result and flag the task as "finished."
If the client fails to finish the task by some deadline...
6b. Re-flag the task as "unfinished."
Seems like we could do better by replacing steps 1, 2, and 3, with pseudorandom sequences or hash functions. But I can't quite figure out the whole solution. Ideas?
Other considerations:
In case it's important, I'm using python and mongodb for all of this. (Mongodb doesn't have some clever "use find_one to efficiently return a random matching entry" usage, does it?)
The term "queue" is a little misleading. All the tasks are stored in subfields of a single collection within the mongodb. The length (total number of tasks) in the collection is known and fixed at the outset.
If it's necessary, it might be okay to let the same task be assigned multiple times, as long as the occurrence is rare. But instances of this kind would need to be very rare, because completing each task is costly.
I have identifying information on each client, so we know exactly who originates each task request.
There is an easy way to get a random document from MongoDB!
See Random record from MongoDB
If you don't want a task to be picked twice, you could mark the task as active and not select it.
Ah, based on the comments that I missed, you can do something along these lines:
import random
available = range(lengthofdatabase)
inprogress = []
while len(available) > 0:
taskindex = available.pop(random.randrange(0, len(available)))
# I'm not sure of your implementation, but you said something
# along these lines was possible
task = GetTask(taskindex)
inprogress.append(taskindex)
I'm not sure of any of the functions you are using - this is just an algorithm.
Happy Coding!
Here's my case. I have three tables Book, Publisher and Price. I have a management command that does loops over each book and for each book, it queries the publisher to get the price which it then stores into the Prices table. It's a very simple HTTP GET or UDP request that I make to get the price. Here what the skeleton of my code looks like:
#transaction.commit_on_success
def handle(self, *args, **options):
for book in Book.objects.all():
for publisher book.publisher_set.objects.all():
price = check_the_price(publisher.url, book.isbn)
Price.objects.create(book=book, publisher=publisher, price=price)
The code is simple, but it gets really slow and time consuming when I have 10000 books. I could easily speed this up by making parallel HTTP requests. I could make 50 parallel requests this would be done in a jiffy but I don't know how to structure this code.
My site itself is very and small and light-weight site and I'm trying to stay away from RabbitMQ/Celery stuff. I just feel it's a big thing to take on right now.
Any recommendations on how to do this while maintaining transactional integrity?
Edit #1: This is used as an analogy for what I'm actually doing. In writing this analogy I forgot to mention that I also need to make a few UDP requests.
You could use the requests package which provides quasi-parallel request processing based on gevent's green threads. requests lets you build a number of request objects which are then executed in "parallel". See this example.
Green threads do not actually run in parallel, but cooperatively yield execution control. gevent can patch the standard library's I/O functions (e.g. the ones used by urllib2) to yield control whenever they would block on I/O otherwise. The request package wraps that into a single function call which takes a number of requests and returns a number of response objects. It doesn't get much easier than that.