Add multithreading or asynchronous to web scrape - python

What is the best way to implement multithreading to make a web scrape faster?
Would using Pool be a good solution - if so where in my code would I implement it?
import requests
from multiprocessing import Pool
with open('testing.txt', 'w') as outfile:
results = []
for number in (4,8,5,7,3,10):
url = requests.get('https://www.google.com/' + str(number))
response =(url)
results.append(response.text)
print(results)
outfile.write("\n".join(results))

This can be moved to a pool easily. Python comes with process and thread based pools. Which to use is a tradeoff. Processes work better for parallelizing running code but are more expensive when passing results back to the main program. In your case, your code is mostly waiting on urls and has a relatively large return object, so thread pools make sense.
I moved the code inside a if __name__ as needed on windows machines.
import requests
from multiprocessing import Pool
from multiprocessing.pool import ThreadPool
def worker(number):
url = requests.get('https://www.google.com/' + str(number))
return url.text
# put some sort of cap on outstanding requests...
MAX_URL_REQUESTS = 10
if __name__ == "__main__":
numbers = (4,8,5,7,3,10)
with ThreadPool(min(len(numbers), MAX_URL_REQUESTS)) as pool:
with open('testing.txt', 'w') as outfile:
for result in pool.map(worker, numbers, chunksize=1):
outfile.write(result)
outfile.write('\n')

Related

How to run code in parallel with ThreadPoolExecutor?

Hi i'm really new to threading and it's making me confused, how can i run this code in parallel ?
def search_posts(page):
page_url = f'https://jsonplaceholder.typicode.com/posts/{page}'
req = requests.get(page_url)
res = req.json()
title = res['title']
return title
page = 1
while True:
with ThreadPoolExecutor() as executer:
t = executer.submit(search_posts, page)
title = t.result()
print(title)
if page == 20:
break
page += 1
Another question is do i need to learn operating systems in order to understand how does threading work?
The problem here is that you are creating a new ThreadPoolExecutor for every page. To do things in parallel, create only one ThreadPoolExecutor and use its map method:
import concurrent.futures as cf
import requests
def search_posts(page):
page_url = f'https://jsonplaceholder.typicode.com/posts/{page}'
res = requests.get(page_url).json()
return res['title']
if __name__ == '__main__':
with cf.ThreadPoolExecutor() as ex:
results = ex.map(search_posts, range(1, 21))
for r in results:
print(r)
Note that using the if __name__ == '__main__' wrapper is a good habit in making your code more portable.
One thing to keep in mind when using threads;
If you are using CPython (the Python implementation from python.org which is the most common one), threads don't actually run in parallel.
To make memory management less complicated, only one thread at a time can be executing Python bytecode in CPython. This is enforced by the Global Interpreter Lock ("GIL") in CPython.
The good news is that using requests to get a web page will spend most of its time using network I/O. And in general, the GIL is released during I/O.
But if you are doing calculations in your worker functions (i.e. executing Python bytecode), you should use a ProcessPoolExecutor instead.
If you use a ProcessPoolExecutor and you are running on ms-windows, then using the if __name__ == '__main__' wrapper is required, because Python has to be able to import your main program without side effects in that case.

multiprocessing running 1 thread at a time x amount of times

I am using multiprocessing for a project which involves going to a URL. I have noticed that whenever I use pool.imap_unordered() whatever my iterator is (lets say it is a list with the number 1 and 2, that are 2 numbers), it will run the program once with one thread, then because there are 2 numbers in the list, it will run another time. I can't seem to figure this out. I thought I understood what everything should be doing. (no, it doesn't run any faster no matter how many threads I have) (the args.urls is originally a file then I convert all the content in the file to a list) everything worked fine until I added multiprocessing so I know it couldn't be an error in my non-multiprocessing related code.
from multiprocessing import Pool
import multiprocessing
import requests
arrange = [ lines.replace("\n", "") for lines in #file ]
def check():
for lines in arrange:
requests.get(lines)
def main():
pool = ThreadPool(4)
results = pool.imap_unordered(check, arrange)
So I'm not entirely sure what you are trying to do but maybe this is what you need:
from multiprocessing import ThreadPool
import multiprocessing
import requests
arrange = [ line.replace("\n", "") for line in #file ]
def check(line):
requests.get(line) # remove the loop, since you are using multiprocessing this is not needed as you pass only one of the lines per thread.
def main():
pool = ThreadPool(4)
results = pool.imap_unordered(check, arrange) # This loops through arrange and provides the check function a single line per call

How to process a list in parallel in Python? [duplicate]

This question already has answers here:
Is there a simple process-based parallel map for python?
(5 answers)
Closed 4 years ago.
I wrote code like this:
def process(data):
#create file using data
all = ["data1", "data2", "data3"]
I want to execute process function on my all list in parallel, because they are creating small files so I am not concerned about disk write but the processing takes long, so I want to use all of my cores.
How can I do this using default modules in python 2.7?
Assuming CPython and the GIL here.
If your task is I/O bound, in general, threading may be more efficient since the threads are simply dumping work on the operating system and idling until the I/O operation finishes. Spawning processes is a heavy way to babysit I/O.
However, most file systems aren't concurrent, so using multithreading or multiprocessing may not be any faster than synchronous writes.
Nonetheless, here's a contrived example of multiprocessing.Pool.map which may help with your CPU-bound work:
from multiprocessing import cpu_count, Pool
def process(data):
# best to do heavy CPU-bound work here...
# file write for demonstration
with open("%s.txt" % data, "w") as f:
f.write(data)
# example of returning a result to the map
return data.upper()
tasks = ["data1", "data2", "data3"]
pool = Pool(cpu_count() - 1)
print(pool.map(process, tasks))
A similar setup for threading can be found in concurrent.futures.ThreadPoolExecutor.
As an aside, all is a builtin function and isn't a great variable name choice.
Or:
from threading import Thread
def process(data):
print("processing {}".format(data))
l= ["data1", "data2", "data3"]
for task in l:
t = Thread(target=process, args=(task,))
t.start()
Or (only python version > 3.6.0):
from threading import Thread
def process(data):
print(f"processing {data}")
l= ["data1", "data2", "data3"]
for task in l:
t = Thread(target=process, args=(task,))
t.start()
There is a template of using multiprocessing, hope helpful.
from multiprocessing.dummy import Pool as ThreadPool
def process(data):
print("processing {}".format(data))
alldata = ["data1", "data2", "data3"]
pool = ThreadPool()
results = pool.map(process, alldata)
pool.close()
pool.join()

Handling multiple http request in Python

I am mining data from a website through Data Scraping in Python. I am using request package for sending the parameters.
Here is the code snippet in Python:
for param in paramList:
data = get_url_data(param)
def get_url_data(param):
post_data = get_post_data(param)
headers = {}
headers["Content-Type"] = "text/xml; charset=UTF-8"
headers["Content-Length"] = len(post_data)
headers["Connection"] = 'Keep-Alive'
headers["Cache-Control"] = 'no-cache'
page = requests.post(url, data=post_data, headers=headers, timeout=10)
data = parse_page(page.content)
return data
The variable paramList is a list of more than 1000 elements and the endpoint url remains the same. I was wondering if there is a better and more faster way to do this ?
Thanks
As there is a significant amount of networking I/O involved, threading should improve the overall performance significantly.
You can try using a ThreadPool and should test and tweak the number of threads to a one that is best suitable for the situation and shows the overall highest performance .
from multiprocessing.pool import ThreadPool
# Remove 'for param in paramList' iteration
def get_url_data(param):
# Rest of code here
if __name__ == '__main__':
pool = ThreadPool(15)
pool.map(get_url_data, paramList) # Will split the load between the threads nicely
pool.close()
I need to make 1000 post request to same domain, I was wondering if
there is a better and more faster way to do this ?
It depends, if it's a static asset or a servlet which you know what it does, if the same parameters will return the same reponse each time you can implement LRU or some other caching mechanism, if not, 1K of POST requests to some servlet doesn't matter even if they have the same domain.
There is an answer with using multiprocessing whith ThreadPool interface, which actually uses the main process with 15 threads, does it runs on 15 cores machine ? because a core can only run one thread each time (except hyper ones, does it run on 8 hyper-cores?)
ThreadPool interface inside library which has a trivial name, multiprocessing, because python has also threading module, this is confusing as f#ck, lets benchmark some lower level code:
import psutil
from multiprocessing.pool import ThreadPool
from time import sleep
def get_url_data(param):
print(param) # just for convenience
sleep(1) # lets assume it will take one second each time
if __name__ == '__main__':
paramList = [i for i in range(100)] # 100 urls
pool = ThreadPool(psutil.cpu_count()) # each core can run one thread (hyper.. not now)
pool.map(get_url_data, paramList) # splitting the jobs
pool.close()
The code above will use the main process with 4 threads in my case because my laptop has 4 CPUs, benchmark result:
$ time python3_5_2 test.py
real 0m28.127s
user 0m0.080s
sys 0m0.012s
Lets try spawning processes w/ multiprocessing
import psutil
import multiprocessing
from time import sleep
import numpy
def get_url_data(urls):
for url in urls:
print(url)
sleep(1) # lets assume it will take one second each time
if __name__ == "__main__":
jobs = []
# Split URLs into chunks as number of CPUs
chunks = numpy.array_split(range(100), psutil.cpu_count())
# Pass each chunk into process
for url_chunk in chunks:
jobs.append(multiprocessing.Process(target=get_url_data, args=(url_chunk, )))
# Start the processes
for j in jobs:
j.start()
# Ensure all of the processes have finished
for j in jobs:
j.join()
Benchmark result: less 3 seconds
$ time python3_5_2 test2.py
real 0m25.208s
user 0m0.260s
sys 0m0.216
If you will execute ps -aux | grep "test.py" you will see 5 processes because one is the main which manage the others.
There are some drawbacks:
You did not explain in depth what your code is doing, but if you doing some work which needs to be synchronized you need to know multiprocessing is NOT thread safe.
Spawning extra processes introduces I/O overhead as data is having to be shuffled around between processors.
Assuming the data is restricted to each process, it is possible to gain significant speedup, be aware of Amdahl's Law.
If you will reveal what your code does afterwards ( save it into file ? database ? stdout ? ) it will be easier to give better answer/direction, few ideas comes up to my mind like immutable infrastructure with Bash or Java to handle synchronization or is it a memory-bound issue and you need an objects pool to process the JSON responses.. might even be a job for fault tolerance Elixir)

how to download multiple file simultaneously and join them in python?

I have some split files on a remote server.
I have tried downloading them one by one and join them. But it takes a lot of time. I googled and found that simultaneous download might speed up things. The script is on Python.
My pseudo is like this:
url1 = something
url2 = something
url3 = something
data1 = download(url1)
data2 = download(url2)
data3 = download(url3)
wait for all download to complete
join all data and save
Could anyone point me to a direction by which I can load files all simultaneously and wait till they are done.
I have tried by creating a class. But again I can't figure out how to wait till all complete.
I am more interested in Threading and Queue feature and I can import them in my platform.
I have tried with Thread and Queue with an example found on this site. Here is the code pastebin.com/KkiMLTqR . But it does not wait or waits forever..not sure
There are 2 ways to do things simultaneously. Or, really, 2-3/4 or so:
Multiple threads
Or multiple processes, especially if the "things" take a lot of CPU power
Or coroutines or greenlets, especially if there are thousands of "things"
Or pools of one of the above
Event loops (either coded manually)
Or hybrid greenlet/event loop systems like gevent.
If you have 1000 URLs, you probably don't want to do 1000 requests at the same time. For example, web browsers typically only do something like 8 requests at a time. A pool is a nice way to do only 8 things at a time, so let's do that.
And, since you're only doing 8 things at a time, and those things are primarily I/O bound, threads are perfect.
I'll implement it with futures. (If you're using Python 2.x, or 3.0-3.1, you will need to install the backport, futures.)
import concurrent.futures
urls = ['http://example.com/foo',
'http://example.com/bar']
with concurrent.futures.ThreadPoolExecutor(max_workers=8) as executor:
result = b''.join(executor.map(download, urls))
with open('output_file', 'wb') as f:
f.write(result)
Of course you need to write the download function, but that's exactly the same function you'd write if you were doing these one at a time.
For example, using urlopen (if you're using Python 2.x, use urllib2 instead of urllib.request):
def download(url):
with urllib.request.urlopen(url) as f:
return f.read()
If you want to learn how to build a thread pool executor yourself, the source is actually pretty simple, and multiprocessing.pool is another nice example in the stdlib.
However, both of those have a lot of excess code (handling weak references to improve memory usage, shutting down cleanly, offering different ways of waiting on the results, propagating exceptions properly, etc.) that may get in your way.
If you look around PyPI and ActiveState, you will find simpler designs like threadpool that you may find easier to understand.
But here's the simplest joinable threadpool:
class ThreadPool(object):
def __init__(self, max_workers):
self.queue = queue.Queue()
self.workers = [threading.Thread(target=self._worker) for _ in range(max_workers)]
def start(self):
for worker in self.workers:
worker.start()
def stop(self):
for _ in range(self.workers):
self.queue.put(None)
for worker in self.workers:
worker.join()
def submit(self, job):
self.queue.put(job)
def _worker(self):
while True:
job = self.queue.get()
if job is None:
break
job()
Of course the downside of a dead-simple implementation is that it's not as friendly to use as concurrent.futures.ThreadPoolExecutor:
urls = ['http://example.com/foo',
'http://example.com/bar']
results = [list() for _ in urls]
results_lock = threading.Lock()
def download(url, i):
with urllib.request.urlopen(url) as f:
result = f.read()
with results_lock:
results[i] = url
pool = ThreadPool(max_workers=8)
pool.start()
for i, url in enumerate(urls):
pool.submit(functools.partial(download, url, i))
pool.stop()
result = b''.join(results)
with open('output_file', 'wb') as f:
f.write(result)
You can use an async framwork like twisted.
Alternatively this is one thing that Python's threads do ok at. Since you are mostly IO bound

Categories

Resources