Run a scraping function on 2 threads for 2 URLs Python - python

I have the following scraping function already implemented in serial, but, because There are multiple URLs with data I would like to parallelize some of the work. here is the working Serial code:
from bs4 import BeautifulSoup as bs
import requests
edbURL='URL1'
psnURL='URL2'
def urlScraper(URL):
page=requests.get(URL)
soup=bs(page.text,'lxml')
l = ['base_URL'+str(i.a['href']) for i in soup.find_all('div',class_='info')]
return l
edbs=urlScraper(edbURL)
psns=urlScraper(psnURL)
What I would like for the two calls to urlScraper(URL) to each get their own thread and run in parallel, I tried using the threads library but only got some big nasty int returns with the following syntax:
edbs = threads.start_new_thread(urlScraper,(edbURL,))
psns = threads.start_new_thread(urlScraper,(psnURL,))
I figure it has something to do with the return in urlScraper(URL), then again, I basically know almost nothing about anything. Thanks for any help everyone!

multiprocessing is a package that supports spawning processes using an API similar to the threading module. The multiprocessing package offers both local and remote concurrency, effectively side-stepping the Global Interpreter Lock by using subprocesses instead of threads. Due to this, the multiprocessing module allows the programmer to fully leverage multiple processors on a given machine. It runs on both Unix and Windows.
https://docs.python.org/2/library/multiprocessing.html

Related

speeding up urlib.urlretrieve

I am downloading pictures from the internet, and as it turns out, I need to download lots of pictures. I am using a version of the following code fragment (actually looping through the links I intend to download and downloading the pictures :
import urllib
urllib.urlretrieve(link, filename)
I am downloading roughly 1000 pictures every 15 minutes, which is awfully slow based on the number of pictures I need to download.
For efficiency, I set a timeout every 5 seconds (still many downloads last much longer):
import socket
socket.setdefaulttimeout(5)
Besides running a job on a computer cluster to parallelize downloads, is there a way to make the picture download faster / more efficient?
my code above was very naive as I did not take advantage of multi-threading. It obviously takes for url requests to be responded but there is no reason why the computer cannot make further requests while the proxy server responds.
Doing the following adjustments, you can improve efficiency by 10x - and there are further ways for improving efficiency, with packages such as scrapy.
To add multi-threading, do something like the following, using the multiprocessing package:
1) encapsulate the url retrieving in a function:
import import urllib.request
def geturl(link,i):
try:
urllib.request.urlretrieve(link, str(i)+".jpg")
except:
pass
2) then create a collection with all urls as well as names you want for the downloaded pictures:
urls = [url1,url2,url3,urln]
names = [i for i in range(0,len(urls))]
3)Import the Pool class from the multiprocessing package and create an object using such class (obviously you would include all imports in the first line of your code in a real program):
from multiprocessing.dummy import Pool as ThreadPool
pool = ThreadPool(100)
then use the pool.starmap() method and pass the function and the arguments of the function.
results = pool.starmap(geturl, zip(links, d))
note: pool.starmap() works only in Python 3
When a program enters I/O wait, the execution is paused so that the kernel can perform the low-level operations associated with the I/O request (this is called a context switch) and is not resumed until the I/O operation is completed.
Context switching is quite a heavy operation. It requires us to save the state of our program (losing any sort of caching we had at the CPU level) and give up the use of the CPU. Later, when we are allowed to run again, we must spend time reinitializing our program on the motherboard and getting ready to resume (of course, all this happens behind the scenes).
With concurrency, on the other hand, we typically have a thing called an “event loop” running that manages what gets to run in our program, and when. In essence, an event loop is simply a list of functions that need to be run. The function at the top of the list gets run, then the next, etc.
The following shows a simple example of an event loop:
from Queue import Queue
from functools import partial
eventloop = None
class EventLoop(Queue):
def start(self):
while True:
function = self.get()
function()
def do_hello():
global eventloop
print "Hello"
eventloop.put(do_world)
def do_world():
global eventloop
print "world"
eventloop.put(do_hello)
if __name__ == "__main__":
eventloop = EventLoop()
eventloop.put(do_hello)
eventloop.start()
If the above seems like something you may use, and you'd also like to see how gevent, tornado, and AsyncIO, can help with your issue, then head out to your (University) library, check out High Performance Python by Micha Gorelick and Ian Ozsvald, and read pp. 181-202.
Note: above code and text are from the book mentioned.

Asynchronous multiple web scrapers in Python 2

I have a legacy code base of many specialized web scrapers, all relying on making synchronous requests to web servers, running while True with a sleep statement at the end. This code base is in Python 2, and it's likely not feasible to move to Python 3 and take advantage of Python 3 async features.
Ideally I'd like to rewrite this set of many individual web scraping scripts as a single pipeline featuring the following
asynchronous web requests (in Python 2)
asynchronous writes to csv
non-blocking sleep statements so that each individual page is scraped at a set frequency
This seems like an easy problem in Python 3 between asyncio and coroutines generally. Can someone recommend how I'd do this/some example resources for doing this in Python 2.
Thanks for any advice.
What you could do is put each function in a different file then when you want them to all go you can do.
import os
os.system('python file1.py')
os.system('python file2.py')
os.system('python file3.py')
os.system('python file4.py')

How to simultaneously download webpages using python?

I am writing a web scraping application in Python. The website I am scraping has urls of the form www.someurl.com/getPage?id=x where x is a number identifying the page. Now, I am downloading all the pages using urlretrieve
Here is the basic form of my script:
for i in range(1,1001):
urlretrieve('http://someurl.com/getPage?id='+str(i) , str(i)+".html)
Now, my question - is it possible to download the pages simultaneously? Because, here I am blocking the script and waiting for the page to download. Can I ask Python to open more than one connection to the server?
Getting some google searches concurrently in Python 2:
from multiprocessing.pool import ThreadPool
from urllib import urlretrieve
def loadpage(x):
urlretrieve('http://google.com/search?q={}'.format(x), '{}.html'.format(x))
p = ThreadPool(10) # the max number of webpages to get at once
p.map(loadpage, range(50))
You could just as easily use Pool instead of ThreadPool. That would make it run on multiple processes/CPU cores. But since this is IO bound I think the concurrency that threading offers is enough.
No, you cannot ask python to open more than one connection, you have to use either a framework for doing this or program a threaded application youself.
scrapy is a framework for downloading multiple pages at the same time.
twisted is a framework for threading, and it does handle multiple protocols. It is alot simpler to just use scrapy, but if you insist on building stuff yourself, this is probably what you want to use.
You could use multi-threading to web scrape as it was used on the link Threading
OR
you could check the simple example for threading on this link.

python 3.4 multiprocessing

This question is asking for advice as well as assistance with some code.
I currently am learning Python with 3.4
I have built a basic network checking tool, i import items from a text file and for each of them i want python to check dns (using pydns), ping the ip (using subprocess to call OS native ping).
Currently i am checking 5000 to 9000 thousand IP address and its taking a number of hours, approx 4 to return all the results.
I am wondering if i can use multiprocessing or threading to speed this up but still the return the output to a list so that the row can be written to a csv file at the very end of the script in bulk.
I am new to python so please tell me if i have overlooked something i should of also.
Main code
http://pastebin.com/ZS23XrdE
Class
http://pastebin.com/kh65hYhG
You could use multiple threads to run child processes (ping in your case) and collect their output but it is not necessary. Here's a code example how to make multiple http requests using a thread pool. Here's code that uses concurrent.futures to make dns requests concurrently.
You don't need multiple threads/process to check 5000-9000 IPs (DNS, ICMP).
You could use gevent, twisted, asyncio to make network connections in the same process.
As most of the work seems IO based, you can easily rely on Threads.
Take a look at the Executor.map() function in cocurrent.futures:
https://docs.python.org/3/library/concurrent.futures.html
You can pass the list of IPs and the function you want to run against each element, the returned value, virtually, is the list of results of the given function.
In your specific case you can wrap the two worker's methods (check_dns_ip and os_ping) in a single one and pass it to the ThreadPoolExecutor.map function.

Parallel data processing in Python

I have an architecture which is basically a queue with url addresses and some classes to process the content of those url addresses. At the moment the code works good, but it is slow to sequentially pull a url out of the queue, send it to the correspondent class, download the url content and finally process it.
It would be faster and make proper use of resources if for example it could read n urls out of the queue and then shoot n processes or threads to handle the downloading and processing.
I would appreciate if you could help me with these:
What packages could be used to solve this problem ?
What other approach can you think of ?
You might want to look into the Python Multiprocessing library. With multiprocessing.pool, you can give it a function and an array, and it will call the function with each value of the array in parallel, using as many or as few processes as you specify.
If C-calls are slow, like downloading, database requests, other IO - You can use just threading.Thread
If python code is slow, like frameworks, your logic, not accelerated parsers - You need to use multiprocessing Pool or Process. Also it speedups python code, but it is less tread-save and need to deep understanding how it works in complex code (locks, semaphores).

Categories

Resources