I am writing Python code to expand shortened URLs fetched from Twitter. I have fetched all the URLs and stored them in a text file separated by a newline.
Currently I am using:
response = urllib2.urlopen(url)
return response.url
to expand them.
But the urlopen() method doesn't seem to be very fast in expanding the URLs.
I have around 5.4 million URLs. Is there any faster way to expand them using Python?
I suspect the issue is that network calls are slow and urllib blocks until it gets a response. So, for example, say it takes 200ms to get a response from the URL shortening service, then you'll only be able to resolve 5 URLs/second using urllib. However, if you use an async library you should be able to send out lots of requests before you get the first answer. Responses are then processed as they arrive back to your code. This should dramatically increase your throughput. There's a few Python libs for this kind of thing (Twisted, gevent, etc.) so you might just want to Google for "Python async rest".
You could also try to do this with lots of threads (I think urllib2 will release the GIL while it waits for a response, but not sure). That wouldn't be as fast as async, but should still speed things up quite a bit.
Both of these solutions introduce quite a bit of complexity, but if you want to go fast...
Related
I am new to Python, I want to understand if the Http request is Synchronous or Async? do i need to implement callbacks ?
I am using urllib2 module and below is the syntax:
content = urllib2.urlopen(urlnew).read()
At my server there are more than 30,000 records and for each one there will be an http call and the response received will be stored.
Any help appreciated.
Like most Python stuff, unless explicitely mentioned, urllib2 is synchronous. So the execution will block until the server responded.
So if you want to make 30,000 requests, you will have to do one request after the other one. An alternative would be to launch the requests in multiple processes (using multiprocessing) to parallelize it.
But the better option, especially since you seem to be in control of the server, would be to have it provide some kind of batch request that allows you to query multiple (or all) records at once.
i wrote a Python web scraper yesterday and ran it in my terminal overnight. it only got through 50k pages. so now i just have a bunch of terminals open concurrently running the script at different starting and end points. this works fine because the main lag is obviously opening web pages and not actual CPU load. more elegant way to do this? especially if it can be done locally
You have an I/O bound process, so to speed it up you will need to send requests concurrently. This doesn't necessarily require multiple processors, you just need to avoid waiting until one request is done before sending the next.
There are a number of solutions for this problem. Take a look at this blog post or check out gevent, asyncio (backports to pre-3.4 versions of Python should be available) or another async IO library.
However, when scraping other sites, you must remember: you can send requests very fast with concurrent programming, but depending on what site you are scraping, this may be very rude. You could easily bring a small site serving dynamic content down entirely, forcing the administrators to block you. Respect robots.txt, try to spread your efforts between multiple servers at once rather than focusing your entire bandwidth on a single server, and carefully throttle your requests to single servers unless you're sure you don't need to.
I need to make a Web Crawling do requests and bring the responses complete and quickly, if possible.
I come from the Java language. I used two "frameworks" and neither fully satisfied my intent.
The Jsoup had the request/response fast but wore incomplete data when the page had a lot of information. The Apache HttpClient was exactly the opposite of this, reliable data but very slow.
I've looked over some of Python modules and I'm testing Scrapy. In my searches, I was unable to conclude whether it is the fastest and brings the data consistently, or is there some other better, even more verbose or difficult.
Second, Python is a good language for this purpose?
Thank you in advance.
+1 votes for Scrapy. For the past several weeks I have been writing crawlers of massive car forums, and Scrapy is absolutely incredible, fast, and reliable.
looking for something to "do requests and bring the responses complete and quickly" makes no sense.
A. Any HTTP library will give you the complete headers/body the server responds with.
B. how "quick" a web request happens is generally dictated by your network connection and server's response time, not the client you are using.
so with those requirements, anything will do.
check out the requests package. It is an excellent http client library for Python.
Here's my case. I have three tables Book, Publisher and Price. I have a management command that does loops over each book and for each book, it queries the publisher to get the price which it then stores into the Prices table. It's a very simple HTTP GET or UDP request that I make to get the price. Here what the skeleton of my code looks like:
#transaction.commit_on_success
def handle(self, *args, **options):
for book in Book.objects.all():
for publisher book.publisher_set.objects.all():
price = check_the_price(publisher.url, book.isbn)
Price.objects.create(book=book, publisher=publisher, price=price)
The code is simple, but it gets really slow and time consuming when I have 10000 books. I could easily speed this up by making parallel HTTP requests. I could make 50 parallel requests this would be done in a jiffy but I don't know how to structure this code.
My site itself is very and small and light-weight site and I'm trying to stay away from RabbitMQ/Celery stuff. I just feel it's a big thing to take on right now.
Any recommendations on how to do this while maintaining transactional integrity?
Edit #1: This is used as an analogy for what I'm actually doing. In writing this analogy I forgot to mention that I also need to make a few UDP requests.
You could use the requests package which provides quasi-parallel request processing based on gevent's green threads. requests lets you build a number of request objects which are then executed in "parallel". See this example.
Green threads do not actually run in parallel, but cooperatively yield execution control. gevent can patch the standard library's I/O functions (e.g. the ones used by urllib2) to yield control whenever they would block on I/O otherwise. The request package wraps that into a single function call which takes a number of requests and returns a number of response objects. It doesn't get much easier than that.
I am trying to implement a basic lib to issue HTTP GET requests. My target is to receive data through socket connections - minimalistic design to improve performance - usage with threads, thread pool(s).
I have a bunch of links which I group by their hostnames, so here's a simple demonstration of input URLs:
hostname1.com - 500 links
hostname2.org - 350 links
hostname3.co.uk - 100 links
...
I intend to use sockets because of performance issues. I intend to use a number of sockets which keeps connected (if possible and it usually is) and issue HTTP GET requests. The idea came from urllib low performance on continuous requests, then I met urllib3, then I realized it uses httplib and then I decided to try sockets. So here's what I accomplished till now:
GETSocket class, SocketPool class, ThreadPool and Worker classes
GETSocket class is a minified, "HTTP GET only" version of Python's httplib.
So, I use these classes like that:
sp = Comm.SocketPool(host,size=self.poolsize, timeout=5)
for link in linklist:
pool.add_task(self.__get_url_by_sp, self.count, sp, link, results)
self.count += 1
pool.wait_completion()
pass
__get_url_by_sp function is a wrapper which calls sp.urlopen and saves the result to results list. I am using a pool of 5 threads which has a socket pool of 5 GETSocket classes.
What I wonder is, is there any other possible way that I can improve performance of this system?
I've read about asyncore here, but I couldn't figure out how to use same socket connection with class HTTPClient(asyncore.dispatcher) provided.
Another point, I don't know if I'm using a blocking or a non-blocking socket, which would be better for performance or how to implement which one.
Please be specific about your experiences, I don't intend to import another library to do just HTTP GET so I want to code my own tiny library.
Any help appreciated, thanks.
Do this.
Use multiprocessing. http://docs.python.org/library/multiprocessing.html.
Write a worker Process which puts all of the URL's into a Queue.
Write a worker Process which gets a URL from a Queue and does a GET, saving a file and putting the File information into another Queue. You'll probably want multiple copies of this Process. You'll have to experiment to find how many is the correct number.
Write a worker Process which reads file information from a Queue and does whatever it is that you're trying do.
I finally found a well chosen path to solve my problems. I was using Python 3 for my project and my only option was to use pycurl, so this made me have to port my project back to Python 2.7 series.
Using pycurl, I gained:
- Consistent responses to my requests (actually my script has to deal with minimum 10k URLs)
- With the usage of ThreadPool class I am receiving responses as fast as my system can (received data is processed later - so multiprocessing is not much of a possibility here)
I tried httplib2 first, I realized that it is not acting as solid as it acts on Python 2, by switching to pycurl I lost caching support.
Final conclusion: When it comes to HTTP communication, one could need a tool like (py)curl at his disposal. It is a lifesaver, especially when one is dealing with loads of URLs (try sometimes for fun: you will get lots of weird responses from them)
Thanks for the replies, folks.