Asynchronous multiple web scrapers in Python 2 - python

I have a legacy code base of many specialized web scrapers, all relying on making synchronous requests to web servers, running while True with a sleep statement at the end. This code base is in Python 2, and it's likely not feasible to move to Python 3 and take advantage of Python 3 async features.
Ideally I'd like to rewrite this set of many individual web scraping scripts as a single pipeline featuring the following
asynchronous web requests (in Python 2)
asynchronous writes to csv
non-blocking sleep statements so that each individual page is scraped at a set frequency
This seems like an easy problem in Python 3 between asyncio and coroutines generally. Can someone recommend how I'd do this/some example resources for doing this in Python 2.
Thanks for any advice.

What you could do is put each function in a different file then when you want them to all go you can do.
import os
os.system('python file1.py')
os.system('python file2.py')
os.system('python file3.py')
os.system('python file4.py')

Related

Run a scraping function on 2 threads for 2 URLs Python

I have the following scraping function already implemented in serial, but, because There are multiple URLs with data I would like to parallelize some of the work. here is the working Serial code:
from bs4 import BeautifulSoup as bs
import requests
edbURL='URL1'
psnURL='URL2'
def urlScraper(URL):
page=requests.get(URL)
soup=bs(page.text,'lxml')
l = ['base_URL'+str(i.a['href']) for i in soup.find_all('div',class_='info')]
return l
edbs=urlScraper(edbURL)
psns=urlScraper(psnURL)
What I would like for the two calls to urlScraper(URL) to each get their own thread and run in parallel, I tried using the threads library but only got some big nasty int returns with the following syntax:
edbs = threads.start_new_thread(urlScraper,(edbURL,))
psns = threads.start_new_thread(urlScraper,(psnURL,))
I figure it has something to do with the return in urlScraper(URL), then again, I basically know almost nothing about anything. Thanks for any help everyone!
multiprocessing is a package that supports spawning processes using an API similar to the threading module. The multiprocessing package offers both local and remote concurrency, effectively side-stepping the Global Interpreter Lock by using subprocesses instead of threads. Due to this, the multiprocessing module allows the programmer to fully leverage multiple processors on a given machine. It runs on both Unix and Windows.
https://docs.python.org/2/library/multiprocessing.html

How to simultaneously download webpages using python?

I am writing a web scraping application in Python. The website I am scraping has urls of the form www.someurl.com/getPage?id=x where x is a number identifying the page. Now, I am downloading all the pages using urlretrieve
Here is the basic form of my script:
for i in range(1,1001):
urlretrieve('http://someurl.com/getPage?id='+str(i) , str(i)+".html)
Now, my question - is it possible to download the pages simultaneously? Because, here I am blocking the script and waiting for the page to download. Can I ask Python to open more than one connection to the server?
Getting some google searches concurrently in Python 2:
from multiprocessing.pool import ThreadPool
from urllib import urlretrieve
def loadpage(x):
urlretrieve('http://google.com/search?q={}'.format(x), '{}.html'.format(x))
p = ThreadPool(10) # the max number of webpages to get at once
p.map(loadpage, range(50))
You could just as easily use Pool instead of ThreadPool. That would make it run on multiple processes/CPU cores. But since this is IO bound I think the concurrency that threading offers is enough.
No, you cannot ask python to open more than one connection, you have to use either a framework for doing this or program a threaded application youself.
scrapy is a framework for downloading multiple pages at the same time.
twisted is a framework for threading, and it does handle multiple protocols. It is alot simpler to just use scrapy, but if you insist on building stuff yourself, this is probably what you want to use.
You could use multi-threading to web scrape as it was used on the link Threading
OR
you could check the simple example for threading on this link.

python 3.4 multiprocessing

This question is asking for advice as well as assistance with some code.
I currently am learning Python with 3.4
I have built a basic network checking tool, i import items from a text file and for each of them i want python to check dns (using pydns), ping the ip (using subprocess to call OS native ping).
Currently i am checking 5000 to 9000 thousand IP address and its taking a number of hours, approx 4 to return all the results.
I am wondering if i can use multiprocessing or threading to speed this up but still the return the output to a list so that the row can be written to a csv file at the very end of the script in bulk.
I am new to python so please tell me if i have overlooked something i should of also.
Main code
http://pastebin.com/ZS23XrdE
Class
http://pastebin.com/kh65hYhG
You could use multiple threads to run child processes (ping in your case) and collect their output but it is not necessary. Here's a code example how to make multiple http requests using a thread pool. Here's code that uses concurrent.futures to make dns requests concurrently.
You don't need multiple threads/process to check 5000-9000 IPs (DNS, ICMP).
You could use gevent, twisted, asyncio to make network connections in the same process.
As most of the work seems IO based, you can easily rely on Threads.
Take a look at the Executor.map() function in cocurrent.futures:
https://docs.python.org/3/library/concurrent.futures.html
You can pass the list of IPs and the function you want to run against each element, the returned value, virtually, is the list of results of the given function.
In your specific case you can wrap the two worker's methods (check_dns_ip and os_ping) in a single one and pass it to the ThreadPoolExecutor.map function.

How to run multithreaded Python scripts

i wrote a Python web scraper yesterday and ran it in my terminal overnight. it only got through 50k pages. so now i just have a bunch of terminals open concurrently running the script at different starting and end points. this works fine because the main lag is obviously opening web pages and not actual CPU load. more elegant way to do this? especially if it can be done locally
You have an I/O bound process, so to speed it up you will need to send requests concurrently. This doesn't necessarily require multiple processors, you just need to avoid waiting until one request is done before sending the next.
There are a number of solutions for this problem. Take a look at this blog post or check out gevent, asyncio (backports to pre-3.4 versions of Python should be available) or another async IO library.
However, when scraping other sites, you must remember: you can send requests very fast with concurrent programming, but depending on what site you are scraping, this may be very rude. You could easily bring a small site serving dynamic content down entirely, forcing the administrators to block you. Respect robots.txt, try to spread your efforts between multiple servers at once rather than focusing your entire bandwidth on a single server, and carefully throttle your requests to single servers unless you're sure you don't need to.

Best way of parallelizing functions from django task

I have defined a Django task (it gets launched using ./manage.py task_name). This task reads a set of objects from the database and performs an operation (usually sending a ping) on each of them, writing each individual result back to the database.
Currently I have a plain for loop, but it's obviously too slow, because it waits for each ping to end to start with the next one. So my question here is, what's the best way of parallelizing the operations?
As far as I've read, the best way I've found is using Pool from the multiprocessing module, something like the code in this answer.
For your task, which appears pretty simple, multiprocessing is probably the easiest approach, if only because it's already part of the stdlib. You could do it something like this (untested!):
def run_process(record):
result = ping(record)
pool = Pool(processes=10)
results = pool.map_async(run_process, [records])
for r in results.get():
write_to_database(r)
I would simply recommend celery.
Write celery tasks for operations which you want to be executed parallelizing/async. Let celery handle the concurrency, and you own code can get rid of the mess process management.
I'd say that the best tool would be some event-driven networking engine like twisted library
unlike multi threading / multi processing solutions, event-driven networking engines shine when it comes to intense io operations, without context switching and waiting for block operation they use the system resources in the most efficient way.
one way to use twisted library is to write a scrapy spider that will handle both external network calls like those ping requests you mentioned as well as writing back the response to the database.
a few guidelines for writing such spider:
to read spider list of urls from the database see https://gist.github.com/saidimu/1024207
to properly write the responses to the database see Writing items to a MySQL database in Scrapy
once you have this spider written, simply launch it from your django command or straight from the shell:
scrapy crawl <spider name>

Categories

Resources