i wrote a Python web scraper yesterday and ran it in my terminal overnight. it only got through 50k pages. so now i just have a bunch of terminals open concurrently running the script at different starting and end points. this works fine because the main lag is obviously opening web pages and not actual CPU load. more elegant way to do this? especially if it can be done locally
You have an I/O bound process, so to speed it up you will need to send requests concurrently. This doesn't necessarily require multiple processors, you just need to avoid waiting until one request is done before sending the next.
There are a number of solutions for this problem. Take a look at this blog post or check out gevent, asyncio (backports to pre-3.4 versions of Python should be available) or another async IO library.
However, when scraping other sites, you must remember: you can send requests very fast with concurrent programming, but depending on what site you are scraping, this may be very rude. You could easily bring a small site serving dynamic content down entirely, forcing the administrators to block you. Respect robots.txt, try to spread your efforts between multiple servers at once rather than focusing your entire bandwidth on a single server, and carefully throttle your requests to single servers unless you're sure you don't need to.
Related
I'm using Django with Uwsgi. We have 8 processes running, and I have no real indication that our code is particularly thread safe, as it was never designed with threads in mind.
Recently, we added the ability to get live rates from vendors of a service through their various APIs and display them at once for the user. The problem is these requests are old web services technologies, and due to their response times, the time needed before the all rates from vendors are acquired (or it gives up), can be up to 10 seconds.
This presents a problem. We have a pretty decent amount of traffic on our site, and the customers need to look at these rates pretty often. With only 8 processes, it's quite easy to see how the server can get tied up waiting on these upstream requests. Especially when other optimizations need to be made to make the site baseline faster anyway (we're working on that).
We made a separate library (which should be mostly threadsafe, and if not, should be converted to it easily enough) for the rates requesting, and we can separate out its configuration. So I was thinking of making a separate service with its own threads, perhaps in Twisted, and having the browser contact that service for JSON instead of having it run in the main Django server.
Is this solution a good one? Can you think of a better or simpler way to do it? Should I use something other than Twisted, and if so, why?
If you want to use your code in-process with Django, you can simply call out to your Twisted by using Crochet, which can automatically manage the creation, running, and shutdown of the reactor within whatever WSGI implementation you choose (presuming that it behaves like a regular Python process, at least).
Obviously it might be less complex to just run within the Twisted WSGI container :-).
It might also be worth looking at TReq to issue your service client requests; your new "thread safe" library will still have the disadvantage of tying up an entire thread for each blocking client, which is a non-trivial amount of memory and additional concurrency overhead, whereas with Twisted you will only need to worry about a couple of objects.
I have a service running on a local server, written using Python threading library. Think of it as a kind of web crawler. It uses 50 threads. I want deploy it on Amazon Web Services cloud and scale it up, so it uses more threads.
Simply, I have two queues: Qinput with URLs and Qoutput with pages content. The threads pick URLs from Qinput, fetch content of the web page an put it to Qoutput
Question: is it enough that I simply increase the number of threads to, say, 500, 5,000 or 50,000 and AWS + Python will handle it? Should I expect the service to run seamlessly or there are some "standard" design pitfalls that I should be aware of when porting a multithreading service on AWS?
I am aware of Global Interpreter Lock although it should not be an issue here, as the main task of the threads is to call outside the interpreter while crawling / scraping pages
Any single instance has its limit. You will probably be able to spawn quite a lot of threads in your instance, especially if you choose the larger ones. But you will get diminished return on the additional threads, until it will not help you any more to get more performance.
However, if you want your system to scale beyond the limitation of a single instance, it is best to be able to run your system on multiple instances. Then your decisions is only operational and not technical. I think that if you are running in AWS environment, which allows you almost endless operational resources, you should think into it.
You can also check out SQS, which is basically a distributed queue system. It will allow you to synchronize the work of as many instances as you need.
The scenario is save the response of an API request using RMDB id as a parameter.
I want to grab all the movie info from imdv-id tt0000001 to tt9999999.
Now I'm using gevent to run several threads(gevent.joinall(threads)), it's not so fast.
Is there other solutions for this kind of problems, like using Celery+RabbitMQ?
For one you must make sure that you aren't making any blocking calls in your code,
as that will also block everything else from running, slowing the entire system.
Reasons for blocking include tight loops or IO that has not been patched by eventlet's monkey patch (e.g. C extensions).
Celery supports using eventlet & gevent, and that is probably the recommended concurrency
option for what you are doing (web request IO). Celery may not make your code run faster though, but it enables you to easily distribute the work to many machines.
To optimize you should always profile your code to find out what the bottleneck is. It could be many things, e.g. slow network, slow host, slow DNS or something else entirely.
I'm working on a fairly simple CGI with Python. I'm about to put it into Django, etc. The overall setup is pretty standard server side (i.e. computation is done on the server):
User uploads data files and clicks "Run" button
Server forks jobs in parallel behind the scenes, using lots of RAM and processor power. ~5-10 minutes later (average use case), the program terminates, having created a file of its output and some .png figure files.
Server displays web page with figures and some summary text
I don't think there are going to be hundreds or thousands of people using this at once; however, because the computation going on takes a fair amount of RAM and processor power (each instance forks the most CPU-intensive task using Python's Pool).
I wondered if you know whether it would be worth the trouble to use a queueing system. I came across a Python module called beanstalkc, but on the page it said it was an "in-memory" queueing system.
What does "in-memory" mean in this context? I worry about memory, not just CPU time, and so I want to ensure that only one job runs (or is held in RAM, whether it receives CPU time or not) at a time.
Also, I was trying to decide whether
the result page (served by the CGI) should tell you it's position in the queue (until it runs and then displays the actual results page)
OR
the user should submit their email address to the CGI, which will email them the link to the results page when it is complete.
What do you think is the appropriate design methodology for a light traffic CGI for a problem of this sort? Advice is much appreciated.
Definitely use celery. You can run an amqp server or I think you can sue the database as a queue for the messages. It allows you to run tasks in the background and it can use multiple worker machines to do the processing if you want. It can also do cron jobs that are database based if you use django-celery
It's as simple as this to run a task in the background:
#task
def add(x, y):
return x + y
In a project I have it's distributing the work over 4 machines and it works great.
I have a web app that needs both functionality and performance tested, and part of the test suite that we plan on using is already written in Python. When I first wrote this, I used mechanize as my means of web-scraping, but it seems to be too bulky for what I'm trying to do (either that or I'm missing something).
The basic layout of what I'm trying to do is as follows. All are objects.
User has Comm (used to be the interface between my stuff and mechanize)
Comm has Browser (holds my CookieJar, urllib2, and BeautifulSoup objects, used to be mechanize)
Browser has Form(s) (used to be mechanize-handled)
Now, as far as threading goes, I have that down. Adjustment between dealing with the GIL and having separate instances of Python running will be made as needed, but suggestions will be taken.
So what I need to do is thread users hitting the application and doing various things (logging in, filling out forms, submitting forms for processing, etc.) while not making the testing box scream too loudly. My current problem with mechanize seems to be RAM.
Part of what's causing the RAM issue is the need for separate browser instances for each user to keep from overwriting the JSESSIONID cookie every time I do something with a different user.
Much of this might seem trivial, but I'm trying to run thousands of threads here, so little tweaks can mean a lot. Any input is appreciated.
Threading causes problems with the GIL, more so with more cores. Try using mechanize with eventlet to achieve concurrency (via multiple processes) also check out multi-mechanize
Have you considered Twisted, the asynchronous library, for at least doing interaction with users?
I actually went without using mechanize and used the Threading module. This allowed for fairly quick transactions, and I also made sure not to have too much inside of each thread. Login information, and getting the webapp in the state necessary before I threaded helped the threads to run shorter and therefore more quickly.