Python Crawling - Requests faster

Python Crawling - Requests faster - python

I need to make a Web Crawling do requests and bring the responses complete and quickly, if possible.
I come from the Java language. I used two "frameworks" and neither fully satisfied my intent.
The Jsoup had the request/response fast but wore incomplete data when the page had a lot of information. The Apache HttpClient was exactly the opposite of this, reliable data but very slow.
I've looked over some of Python modules and I'm testing Scrapy. In my searches, I was unable to conclude whether it is the fastest and brings the data consistently, or is there some other better, even more verbose or difficult.
Second, Python is a good language for this purpose?
Thank you in advance.

+1 votes for Scrapy. For the past several weeks I have been writing crawlers of massive car forums, and Scrapy is absolutely incredible, fast, and reliable.

looking for something to "do requests and bring the responses complete and quickly" makes no sense.
A. Any HTTP library will give you the complete headers/body the server responds with.
B. how "quick" a web request happens is generally dictated by your network connection and server's response time, not the client you are using.
so with those requirements, anything will do.
check out the requests package. It is an excellent http client library for Python.

Related

Class for making asynchronous API requests with Twisted

I am working on development of a system that collect data from rest servers and manipulates it.
One of the requirements is multiple and frequent API requests. We currently implement this in a somewhat synchronous way. I can easily implement this using threads but considering the system might need to support thousands of requests per second I think it would be wise to utilize Twisted's ability to efficiently implement the above. I have seen this blog post and the whole idea of deferred list seems to do the trick. But I am kind of stuck with how to structure my class (can't wrap my mind around how Twisted works).
Can you try to outline the structure of the class that will run the event-loop and will be able to get a list of URLs and headers and return a list of results after making the requests asynchronously?
Do you know of a better way of implementing this in python?

Sounds like you want to use a Twisted project called treq which allows you to send requests to HTTP endpoints. It works alot like requests. I recently helped a friend here in this thread. My answer there might be of some use to you. If you still need more help, just make a comment and I'll try my best to update this answer.

Fast expansion of shortened URLs using python

I am writing Python code to expand shortened URLs fetched from Twitter. I have fetched all the URLs and stored them in a text file separated by a newline.
Currently I am using:
response = urllib2.urlopen(url)
return response.url
to expand them.
But the urlopen() method doesn't seem to be very fast in expanding the URLs.
I have around 5.4 million URLs. Is there any faster way to expand them using Python?

I suspect the issue is that network calls are slow and urllib blocks until it gets a response. So, for example, say it takes 200ms to get a response from the URL shortening service, then you'll only be able to resolve 5 URLs/second using urllib. However, if you use an async library you should be able to send out lots of requests before you get the first answer. Responses are then processed as they arrive back to your code. This should dramatically increase your throughput. There's a few Python libs for this kind of thing (Twisted, gevent, etc.) so you might just want to Google for "Python async rest".
You could also try to do this with lots of threads (I think urllib2 will release the GIL while it waits for a response, but not sure). That wouldn't be as fast as async, but should still speed things up quite a bit.
Both of these solutions introduce quite a bit of complexity, but if you want to go fast...

Scrapy : preventive measures before running the scrape

I'm about to scrape some 50.000 records of a real estate website (with Scrapy).
The programming has been done and tested, and the database properly designed.
But I want to be prepared for unexpected events.
So how do I go about actually running the scrape flawlessly and with minimal risk of failure and loss of time?
More specifically :
Should I carry it out in phases (scraping in smaller batches) ?
What and how should I log ?
Which other points of attention should I take into account before launching ?

First of all, study the following topics to have a general idea on how to be a good web-scraping citizen:
Web scraping etiquette
Screen scraping etiquette
In general, first, you need to make sure you are legally allowed to scrape this particular web-site and follow their Terms of Use rules. Also, check web-site's robots.txt and respect the rules listed there (for example, there can be Crawl-delay directive set). Also, a good idea would be to contact web-site owner's and let them know what you are going to do or ask for the permission.
Identify yourself by explicitly specifying a User-Agent header.
See also:
Is this Anti-Scraping technique viable with Robots.txt Crawl-Delay?
What will happen if I don't follow robots.txt while crawling?
Should I carry it out in phases (scraping in smaller batches) ?
This is what DOWNLOAD_DELAY setting is about:
The amount of time (in secs) that the downloader should wait before
downloading consecutive pages from the same website. This can be used
to throttle the crawling speed to avoid hitting servers too hard.
CONCURRENT_REQUESTS_PER_DOMAIN and CONCURRENT_REQUESTS_PER_IP are also relevant.
Tweak these settings for not hitting the web-site servers too often.
What and how should I log ?
The information that Scrapy puts on the console is pretty much extensive, but you may want to log all the errors and exceptions being raised while crawling. I personally like the idea of listening for spider_error signal to be fired, see:
how to process all kinds of exception in a scrapy project, in errback and callback?
Which other points of attention should I take into account before
launching ?
You still have several things to think about.
At some point, you may get banned. There is always a reason for this, the most obvious would be that you would still crawl them too hard and they don't like it. There are certain techniques/tricks to avoid getting banned, like rotating IP addresses, using proxies, web-scraping in the cloud etc, see:
Avoiding getting banned
Another thing to worry about might be the crawling speed and scaling; at this point you may want to think about distributing your crawling process. This is there scrapyd would help, see:
Distributed crawls
Still, make sure you are not crossing the line and staying on the legal side.

Writing a Python data analysis server for a Java interface

I want to write data analysis plugins for a Java interface. This interface is potentially run on different computers. The interface will send commands and the Python program can return large data. The interface is distributed by a Java Webstart system. Both access the main data from a MySQL server.
What are the different ways and advantages to implement the communication? Of course, I've done some research on the internet. While there are many suggestions I still don't know what the differences are and how to decide for one. (I have no knowledge about them)
I've found a suggestion to use sockets, which seems fine. Is it simple to write a server that dedicates a Python analysis process for each connection (temporary data might be kept after one communication request for that particular client)?
I was thinking to learn how to use sockets and pass YAML strings.
Maybe my main question is: What is the relation to and advantage of systems like RabbitMQ, ZeroMQ, CORBA, SOAP, XMLRPC?
There were also suggestions to use pipes or shared memory. But that wouldn't fit to my requirements?
Does any of the methods have advantages for debugging or other pecularities?
I hope someone can help me understand the technology and help me decide on a solution, as it is hard to judge from technical descriptions.
(I do not consider solutions like Jython, JEPP, ...)

Offering an opinion on the merits you described, it sounds like you are dealing with potentially large data/queries that may take a lot of time to fetch and serialize, in which case you definitely want to go with something that can handle concurrent connections without stacking up threads. Thereby, in the Python domain, I can't recommend any networking library other than Twisted.
http://twistedmatrix.com/documents/current/core/examples/
Whether you decide to use vanilla HTTP or your own protocol, twisted is pretty much the one stop shop for concurrent networking. Sure, the name gets thrown around alot, and the documentation is Atlantean, but if you take the time to learn it there is very little in the networking domain you cannot accomplish. You can extend the base protocols and factories to make one server that can handle your data in a reactor-based event loop and respond to deferred request when ready.
The serialization format really depends on the nature of the data. Will there be any binary in what is output as a response? Complex types? That rules out JSON if so, though that is becoming the most common serialization format. YAML sometimes seems to enjoy a position of privilege among the python community - I haven't used it extensively as most of the kind of work I've done with serials was data to be rendered in a frontend with javascript.
Message queues are really the most important tool in the toolbox when you need to defer background tasks without hanging response. They are commonly employed in web apps where the HTTP request should not hang until whatever complex processing needs to take place completes, so the UI can render early and count on an implicit "promise" the processing will take place. They have two important traits: they rely on eventual consistency, in that the process can finish long after the response in the protocol is sent, and they also have fail-safe and try-again directives should a task fail. They are where you turn in the "do this really hard task as soon as you can and I trust you to get it done" problem domain.
If we are not talking about potentially HUGE response bodies, and relatively simple data types within the serialized output, there is nothing wrong with rolling a simple HTTP deferred server in Twisted.

Non-Message Queue / Simple Long-Polling in Python (and Flask)

I am looking for a simple (i.e., not one that requires me to setup a separate server to handle a messaging queue) way to do long-polling for a small web-interface that runs calculations and produces a graph. This is what my web-interface needs to do:
User requests a graph/data in a web-interface
Server runs some calculations.
While the server is running calculations, a small container is updated (likely via AJAX/jQuery) with the calculation progress (similar to what you'd do in a consol with print (i.e. print 'calculating density function...'))
Calculation finishes and graph is shown to user.
As the calculation is all done server-side, I'm not really sure how to easily set this up. Obviously I'll want to setup a REST API to handle the polling, which would be easy in Flask. However, I'm not sure how to retrieve the actual updates. An obvious, albeit complicated for this purpose, solution would be to setup a messaging queue and do some long polling. However, I'm not sure sure this is the right approach for something this simple.
Here are my questions:
Is there a way to do this using the file system? Performance isn't a huge issue. Can AJAX/jQuery find messages from a file? Save the progress to some .json file?
What about pickling? (I don't really know much about pickling, but maybe I could pickle a message dict and it could be read by an API that is handling the polling).
Is polling even the right approach? Is there a better or more common pattern to handle this?
I have a feeling I'm overcomplicating things as I know this kind of thing is common on the web. Quite often I see stuff happening and a little "loading.gif" image is running while some calculation is going on (for example, in Google Analytics).
Thanks for your help!

I've built several apps like this using just Flask and jQuery. Based on that experience, I'd say your plan is good.
Do not use the filesystem. You will run into JavaScript security issues/protections. In the unlikely event you find reasonable workarounds, you still wouldn't have anything portable or scalable. Instead, use a small local web serving framework, like Flask.
Do not pickle. Use JSON. It's the language of web apps and REST interfaces. jQuery and those nice jQuery-based plugins for drawing charts, graphs and such will expect JSON. It's easy to use, human-readable, and for small-scale apps, there's no reason to go any place else.
Long-polling is fine for what you want to accomplish. Pure HTTP-based apps have some limitations. And WebSockets and similar socket-ish layers like Socket.IO "are the future." But finding good, simple examples of the server-side implementation has, in my experience, been difficult. I've looked hard. There are plenty of examples that want you to set up Node.js, REDIS, and other pieces of middleware. But why should we have to set up two or three separate middleware servers? It's ludicrous. So long-polling on a simple, pure-Python web framework like Flask is the way to go IMO.
The code is a little more than a snippet, so rather than including it here, I've put a simplified example into a Mercurial repository on bitbucket that you can freely review, copy, or clone. There are three parts:
serve.py a Python/Flask based server
templates/index.html 98% HTML, 2% template file the Flask-based server will render as HTML
static/lpoll.js a jQuery-based client

Long-polling was a reasonable work-around before simple, natural support for Web Sockets came to most browsers, and before it was easily integrated alongside Flask apps. But here in mid-2013, Web Socket support has come a long way.
Here is an example, similar to the one above, but integrating Flask and Web Sockets. It runs atop server components from gevent and gevent-websocket.
Note this example is not intended to be a Web Socket masterpiece. It retains a lot of the lpoll structure, to make them more easily comparable. But it immediately improves responsiveness, server overhead, and interactivity of the Web app.
Update for Python 3.7+
5 years since the original answer, WebSocket has become easier to implement. As of Python 3.7, asynchronous operations have matured into mainstream usefulness. Python web apps are the perfect use case. They can now use async just as JavaScript and Node.js have, leaving behind some of the quirks and complexities of "concurrency on the side." In particular, check out Quart. It retains Flask's API and compatibility with a number of Flask extensions, but is async-enabled. A key side-effect is that WebSocket connections can be gracefully handled side-by-side with HTTP connections. E.g.:
from quart import Quart, websocket
app = Quart(__name__)
#app.route('/')
async def hello():
return 'hello'
#app.websocket('/ws')
async def ws():
while True:
await websocket.send('hello')
app.run()
Quart is just one of the many great reasons to upgrade to Python 3.7.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.