I am writing a web scraping application in Python. The website I am scraping has urls of the form www.someurl.com/getPage?id=x where x is a number identifying the page. Now, I am downloading all the pages using urlretrieve
Here is the basic form of my script:
for i in range(1,1001):
urlretrieve('http://someurl.com/getPage?id='+str(i) , str(i)+".html)
Now, my question - is it possible to download the pages simultaneously? Because, here I am blocking the script and waiting for the page to download. Can I ask Python to open more than one connection to the server?
Getting some google searches concurrently in Python 2:
from multiprocessing.pool import ThreadPool
from urllib import urlretrieve
def loadpage(x):
urlretrieve('http://google.com/search?q={}'.format(x), '{}.html'.format(x))
p = ThreadPool(10) # the max number of webpages to get at once
p.map(loadpage, range(50))
You could just as easily use Pool instead of ThreadPool. That would make it run on multiple processes/CPU cores. But since this is IO bound I think the concurrency that threading offers is enough.
No, you cannot ask python to open more than one connection, you have to use either a framework for doing this or program a threaded application youself.
scrapy is a framework for downloading multiple pages at the same time.
twisted is a framework for threading, and it does handle multiple protocols. It is alot simpler to just use scrapy, but if you insist on building stuff yourself, this is probably what you want to use.
You could use multi-threading to web scrape as it was used on the link Threading
OR
you could check the simple example for threading on this link.
Related
I have the following scraping function already implemented in serial, but, because There are multiple URLs with data I would like to parallelize some of the work. here is the working Serial code:
from bs4 import BeautifulSoup as bs
import requests
edbURL='URL1'
psnURL='URL2'
def urlScraper(URL):
page=requests.get(URL)
soup=bs(page.text,'lxml')
l = ['base_URL'+str(i.a['href']) for i in soup.find_all('div',class_='info')]
return l
edbs=urlScraper(edbURL)
psns=urlScraper(psnURL)
What I would like for the two calls to urlScraper(URL) to each get their own thread and run in parallel, I tried using the threads library but only got some big nasty int returns with the following syntax:
edbs = threads.start_new_thread(urlScraper,(edbURL,))
psns = threads.start_new_thread(urlScraper,(psnURL,))
I figure it has something to do with the return in urlScraper(URL), then again, I basically know almost nothing about anything. Thanks for any help everyone!
multiprocessing is a package that supports spawning processes using an API similar to the threading module. The multiprocessing package offers both local and remote concurrency, effectively side-stepping the Global Interpreter Lock by using subprocesses instead of threads. Due to this, the multiprocessing module allows the programmer to fully leverage multiple processors on a given machine. It runs on both Unix and Windows.
https://docs.python.org/2/library/multiprocessing.html
I have a legacy code base of many specialized web scrapers, all relying on making synchronous requests to web servers, running while True with a sleep statement at the end. This code base is in Python 2, and it's likely not feasible to move to Python 3 and take advantage of Python 3 async features.
Ideally I'd like to rewrite this set of many individual web scraping scripts as a single pipeline featuring the following
asynchronous web requests (in Python 2)
asynchronous writes to csv
non-blocking sleep statements so that each individual page is scraped at a set frequency
This seems like an easy problem in Python 3 between asyncio and coroutines generally. Can someone recommend how I'd do this/some example resources for doing this in Python 2.
Thanks for any advice.
What you could do is put each function in a different file then when you want them to all go you can do.
import os
os.system('python file1.py')
os.system('python file2.py')
os.system('python file3.py')
os.system('python file4.py')
i wrote a Python web scraper yesterday and ran it in my terminal overnight. it only got through 50k pages. so now i just have a bunch of terminals open concurrently running the script at different starting and end points. this works fine because the main lag is obviously opening web pages and not actual CPU load. more elegant way to do this? especially if it can be done locally
You have an I/O bound process, so to speed it up you will need to send requests concurrently. This doesn't necessarily require multiple processors, you just need to avoid waiting until one request is done before sending the next.
There are a number of solutions for this problem. Take a look at this blog post or check out gevent, asyncio (backports to pre-3.4 versions of Python should be available) or another async IO library.
However, when scraping other sites, you must remember: you can send requests very fast with concurrent programming, but depending on what site you are scraping, this may be very rude. You could easily bring a small site serving dynamic content down entirely, forcing the administrators to block you. Respect robots.txt, try to spread your efforts between multiple servers at once rather than focusing your entire bandwidth on a single server, and carefully throttle your requests to single servers unless you're sure you don't need to.
I have defined a Django task (it gets launched using ./manage.py task_name). This task reads a set of objects from the database and performs an operation (usually sending a ping) on each of them, writing each individual result back to the database.
Currently I have a plain for loop, but it's obviously too slow, because it waits for each ping to end to start with the next one. So my question here is, what's the best way of parallelizing the operations?
As far as I've read, the best way I've found is using Pool from the multiprocessing module, something like the code in this answer.
For your task, which appears pretty simple, multiprocessing is probably the easiest approach, if only because it's already part of the stdlib. You could do it something like this (untested!):
def run_process(record):
result = ping(record)
pool = Pool(processes=10)
results = pool.map_async(run_process, [records])
for r in results.get():
write_to_database(r)
I would simply recommend celery.
Write celery tasks for operations which you want to be executed parallelizing/async. Let celery handle the concurrency, and you own code can get rid of the mess process management.
I'd say that the best tool would be some event-driven networking engine like twisted library
unlike multi threading / multi processing solutions, event-driven networking engines shine when it comes to intense io operations, without context switching and waiting for block operation they use the system resources in the most efficient way.
one way to use twisted library is to write a scrapy spider that will handle both external network calls like those ping requests you mentioned as well as writing back the response to the database.
a few guidelines for writing such spider:
to read spider list of urls from the database see https://gist.github.com/saidimu/1024207
to properly write the responses to the database see Writing items to a MySQL database in Scrapy
once you have this spider written, simply launch it from your django command or straight from the shell:
scrapy crawl <spider name>
I am trying to implement a basic lib to issue HTTP GET requests. My target is to receive data through socket connections - minimalistic design to improve performance - usage with threads, thread pool(s).
I have a bunch of links which I group by their hostnames, so here's a simple demonstration of input URLs:
hostname1.com - 500 links
hostname2.org - 350 links
hostname3.co.uk - 100 links
...
I intend to use sockets because of performance issues. I intend to use a number of sockets which keeps connected (if possible and it usually is) and issue HTTP GET requests. The idea came from urllib low performance on continuous requests, then I met urllib3, then I realized it uses httplib and then I decided to try sockets. So here's what I accomplished till now:
GETSocket class, SocketPool class, ThreadPool and Worker classes
GETSocket class is a minified, "HTTP GET only" version of Python's httplib.
So, I use these classes like that:
sp = Comm.SocketPool(host,size=self.poolsize, timeout=5)
for link in linklist:
pool.add_task(self.__get_url_by_sp, self.count, sp, link, results)
self.count += 1
pool.wait_completion()
pass
__get_url_by_sp function is a wrapper which calls sp.urlopen and saves the result to results list. I am using a pool of 5 threads which has a socket pool of 5 GETSocket classes.
What I wonder is, is there any other possible way that I can improve performance of this system?
I've read about asyncore here, but I couldn't figure out how to use same socket connection with class HTTPClient(asyncore.dispatcher) provided.
Another point, I don't know if I'm using a blocking or a non-blocking socket, which would be better for performance or how to implement which one.
Please be specific about your experiences, I don't intend to import another library to do just HTTP GET so I want to code my own tiny library.
Any help appreciated, thanks.
Do this.
Use multiprocessing. http://docs.python.org/library/multiprocessing.html.
Write a worker Process which puts all of the URL's into a Queue.
Write a worker Process which gets a URL from a Queue and does a GET, saving a file and putting the File information into another Queue. You'll probably want multiple copies of this Process. You'll have to experiment to find how many is the correct number.
Write a worker Process which reads file information from a Queue and does whatever it is that you're trying do.
I finally found a well chosen path to solve my problems. I was using Python 3 for my project and my only option was to use pycurl, so this made me have to port my project back to Python 2.7 series.
Using pycurl, I gained:
- Consistent responses to my requests (actually my script has to deal with minimum 10k URLs)
- With the usage of ThreadPool class I am receiving responses as fast as my system can (received data is processed later - so multiprocessing is not much of a possibility here)
I tried httplib2 first, I realized that it is not acting as solid as it acts on Python 2, by switching to pycurl I lost caching support.
Final conclusion: When it comes to HTTP communication, one could need a tool like (py)curl at his disposal. It is a lifesaver, especially when one is dealing with loads of URLs (try sometimes for fun: you will get lots of weird responses from them)
Thanks for the replies, folks.