Easiest (least amount of boilerplate code) way to parallelize a python loop? - python

I have some code that looks like this:
for photo in photoInfo:
if not('url' in photo):
raise Exception("Missing URL: " + str(photo) + " in " + str(photoInfo))
sizes = getImageSizes(photo['url'])
photo.update(sizes)
It might not be obvious, but the code performs a mix of high-latency I/O (opening a remote URL) and moderately CPU-intensive process (parsing image and extracting size) for each photo.
What's the easiest way to parallelize this code?
What I have tried so far
I found this code in the answer to another, more complex question, but I'm having a hard time mapping it back to my much simpler use-case:
from itertools import product
from multiprocessing import Pool
with Pool(processes=4) as pool: # assuming Python 3
pool.starmap(print, product(range(2), range(3), range(4)))

from multiprocessing import Pool
import os
def user_defined_function(url):
#your logic for a single url
pass
if __name__ == '__main__':
urls_list = ['u1','u2']
pool = Pool(os.cpu_count()) # Create a multiprocessing pool
pool.map(user_defined_function, urls_list)
it's sample code you can modify it according to your usage. I will map each element of the list to your function and perform it individually.

You can use Pool.map to parallelize the fetching of image sizes, and build a new dict with the returning values and the same keys:
from multiprocessing import Pool
def get_image_size(photo):
if 'url' not in photo:
raise Exception("Missing URL: " + str(photo))
return getImageSizes(photo['url'])
if __name__ == '__main__':
with Pool() as pool:
photoInfo = dict(zip(photoInfo, pool.map(get_image_size, photoInfo)))

Related

Using multiprocessing to speed up website requests in Python

I am using the requests module to download the content of many websites, which looks something like this:
import requests
for i in range(1000):
url = base_url + f"{i}.anything"
r = requests.get(url)
Of course this is simplified, but basically the base url is always the same, I only want to download an image, for example.
This takes very long due to the amount of iterations. The internet connection is not the problem, but rather the amount of time it takes to start a request etc.
So I was thinking about something like multiprocessing, because this task is basically always the same and I could imagine it to be a lot faster when multiprocessed.
Is this somehow doable?
Thanks in advance!
I would suggest that in this case, the lightweight thread would be better. When I ran the request on a certain URL 5 times, the result was:
Threads: Finished in 0.24 second(s)
MultiProcess: Finished in 0.77 second(s)
Your implementation can be something like this:
import concurrent.futures
import requests
from bs4 import BeautifulSoup
import time
def access_url(url,No):
print(f"{No}:==> {url}")
response=requests.get(url)
soup=BeautifulSoup(response.text,features='lxml')
return ("{} : {}".format(No, str(soup.title)[7:50]))
if __name__ == "__main__":
test_url="http://bla bla.com/"
base_url=test_url
THREAD_MULTI_PROCESSING= True
start = time.perf_counter() # calculate the time
url_list=[base_url for i in range(5)] # setting parameter for function as a list so map can be used.
url_counter=[i for i in range(5)] # setting parameter for function as a list so map can be used.
if THREAD_MULTI_PROCESSING:
with concurrent.futures.ThreadPoolExecutor() as executor: # In this case thread would be better
results = executor.map(access_url,url_list,url_counter)
for result in results:
print(result)
end = time.perf_counter() # calculate finish time
print(f'Threads: Finished in {round(end - start,2)} second(s)')
start = time.perf_counter()
PROCESS_MULTI_PROCESSING=True
if PROCESS_MULTI_PROCESSING:
with concurrent.futures.ProcessPoolExecutor() as executor:
results = executor.map(access_url,url_list,url_counter)
for result in results:
print(result)
end = time.perf_counter()
print(f'Threads: Finished in {round(end - start,2)} second(s)')
I think you will see better performance in your case.

Multicore processing on scraper function

I was hoping to speed up my scraper by using multiple cores so multiple cores could scrape from the URLs in a list I have using a predefined function scrape. How would I do this?
Here is my current code:
for x in URLs['identifier'][1:365]:
test= scrape(x)
results = test.get_results
results['identifier'] = x
final= final.append(results)
Something like this, (or you can also use Scrapy)
It will easily allow you to make a lot of requests in parallel provided the server can handle it as well;
# it's just a wrapper around concurrent.futures ThreadPoolExecutor with a nice tqdm progress bar!
from tqdm.contrib.concurrent import thread_map, process_map # for multi-threading, multi-processing respectively)
def chunk_list(lst, size):
for i in range(0, len(lst), size):
yield lst[i:i + size]
for idx, my_chunk in enumerate(chunk_list(huge_list, size=2**12)):
for response in thread_map(<which_func_to_call>, my_chunk, max_workers=your_cpu_cores+6)):
# which_func_to_call -> wrap the returned response json obj in this, etc
# do something with the response now..
# make sure to cache the chunk results as well (in case you are having lot of them)
OR
Using the pool from multi-processing module in Python..
from multiprocessing import Pool
import requests
from bs4 import BeautifulSoup
base_url = 'http://quotes.toscrape.com/page/'
all_urls = list()
def generate_urls():
# better to yield them as well if you already have the URL's list etc..
for i in range(1,11):
all_urls.append(base_url + str(i))
def scrape(url):
res = requests.get(url)
print(res.status_code, res.url)
generate_urls()
p = Pool(10)
p.map(scrape, all_urls)
p.terminate()
p.join()

Splitting a task for multiple threads

I am working around with my personal project.I actually making a brute forcing program in python.I already made it, but the problem is now i want to make it faster by adding some thread to it.The problem is the program has a for loop which repeats for every user,password.So at this point if I make some threads and join the main process to the threads it will do nothing but just repeating the same user,password for every thread.But I don't want this, I want every thread will have a different user,password to bruteforce.Is there any way to tell the threads grab this user,password and now that one because that one is using by another thread.
Thanks.
Here is the code:
import requests as r
user_list = ['a','b','c','d']
pass_list = ['e','f','g','h']
def main_part():
for user,pwd in zip(user_list,pass_list):
action_url = 'https:example.com'
payload = {'user_email':user,'password':pwd}
req = r.post(action_url,data=payload)
print(req.content)
You can use multiprocessing to do what you want. you just need to define a function which handles a single user:
def brute_force_user(user, pwd):
action_url = 'https:example.com'
payload = {'user_email':user,'password':pwd}
req = r.post(action_url,data=payload)
print(req.content)
then run it like that:
import multiprocessing
import os
from itertools import repeat
pool = multiprocessing.Pool(os.cpu_count() - 1)
results = pool.starmap(brute_force_user, user_list, pass_list)

How to use multiprocessing in a for loop - python

I have a script that use python mechanize and bruteforce html form. This is a for loop that check every password from "PassList" and runs until it matches the current password by checking the redirected url. How can i implement multiprocessing here
for x in PasswordList:
br.form['password'] = ''.join(x)
print "Bruteforce in progress.. checking : ",br.form['password']
response=br.submit()
if response.geturl()=="http://192.168.1.106/success.html":
#url to which the page is redirected after login
print "\n Correct password is ",''.join(x)
break
I do hope this is not for malicious purposes.
I've never used python mechanize, but seeing as you have no answers I can share what I know, and you can modify it accordingly.
In general, it needs to be its own function, which you then call pool over. I dont know about your br object, but i would probably recommend having many of those objects to prevent any clashing. (Can try with the same br object tho, modify code accordingly)
list_of_br_and_passwords = [[br_obj,'password1'],[br_obj,'password2'] ...]
from multiprocessing import Pool
from multiprocessing import cpu_count
def crackPassword(lst):
br_obj = lst[0]
password = lst[1]
br.form['password'] = ''.join(password)
print "Bruteforce in progress.. checking : ",br.form['password']
response=br.submit()
pool = Pool(cpu_count() * 2)
crack_password = pool.map(crackPassword,list_of_br_and_passwords)
pool.close()
Once again, this is not a full answer, just a general guideline on how to do multiprocessing
from multiprocessing import Pool
def process_bruteforce(PasswordList):
<process>
if __name__ == '__main__':
pool = Pool(processes=4) # process per core
is_connected = pool.map(process_bruteforce, PasswordList)
I would try something like that

Asynchronously get and store images in python

The following code is a sample of non-asynchronous code, is there any way to get the images asynchronously?
import urllib
for x in range(0,10):
urllib.urlretrieve("http://test.com/file %s.png" % (x), "temp/file %s.png" % (x))
I have also seen the Grequests library but I couldn't figure much if that is possible or how to do it from the documentation.
You don't need any third party library. Just create a thread for every request, start the threads, and then wait for all of them to finish in the background, or continue your application while the images are being downloaded.
import threading
results = []
def getter(url, dest):
results.append(urllib.urlretreave(url, dest))
threads = []
for x in range(0,10):
t = threading.Thread(target=getter, args=('http://test.com/file %s.png' % x,
'temp/file %s.png' % x))
t.start()
threads.append(t)
# wait for all threads to finish
# You can continue doing whatever you want and
# join the threads when you finally need the results.
# They will fatch your urls in the background without
# blocking your main application.
map(lambda t: t.join(), threads)
Optionally you can create a thread pool that will get urls and dests from a queue.
If you're using Python 3 it's already implemented for you in the futures module.
Something like this should help you
import grequests
urls = ['url1', 'url2', ....] # this should be the list of urls
requests = (grequests.get(u) for u in urls)
responses = grequests.map(requests)
for response in responses:
if 199 < response.status_code < 400:
name = generate_file_name() # generate some name for your image file with extension like example.jpg
with open(name, 'wb') as f: # or save to S3 or something like that
f.write(response.content)
Here only the downloading of images would be parallel but writing each image content to a file would be sequential so you can create a thread or do something else to make it parallel or asynchronous

Categories

Resources