How to run code in parallel with ThreadPoolExecutor? - python

Hi i'm really new to threading and it's making me confused, how can i run this code in parallel ?
def search_posts(page):
page_url = f'https://jsonplaceholder.typicode.com/posts/{page}'
req = requests.get(page_url)
res = req.json()
title = res['title']
return title
page = 1
while True:
with ThreadPoolExecutor() as executer:
t = executer.submit(search_posts, page)
title = t.result()
print(title)
if page == 20:
break
page += 1
Another question is do i need to learn operating systems in order to understand how does threading work?

The problem here is that you are creating a new ThreadPoolExecutor for every page. To do things in parallel, create only one ThreadPoolExecutor and use its map method:
import concurrent.futures as cf
import requests
def search_posts(page):
page_url = f'https://jsonplaceholder.typicode.com/posts/{page}'
res = requests.get(page_url).json()
return res['title']
if __name__ == '__main__':
with cf.ThreadPoolExecutor() as ex:
results = ex.map(search_posts, range(1, 21))
for r in results:
print(r)
Note that using the if __name__ == '__main__' wrapper is a good habit in making your code more portable.
One thing to keep in mind when using threads;
If you are using CPython (the Python implementation from python.org which is the most common one), threads don't actually run in parallel.
To make memory management less complicated, only one thread at a time can be executing Python bytecode in CPython. This is enforced by the Global Interpreter Lock ("GIL") in CPython.
The good news is that using requests to get a web page will spend most of its time using network I/O. And in general, the GIL is released during I/O.
But if you are doing calculations in your worker functions (i.e. executing Python bytecode), you should use a ProcessPoolExecutor instead.
If you use a ProcessPoolExecutor and you are running on ms-windows, then using the if __name__ == '__main__' wrapper is required, because Python has to be able to import your main program without side effects in that case.

Related

Python Multithreading Rest API

I download Data over a restAPI and wrote a module. The download takes lets say 10sec. During this time, the rest of the script in 'main' and in the module is not running until the download is finished.
How can I change it, e.g. by processing it in another core?
I tried this code but it does not do the trick (same lag). Then I tried to implement this approach and it just gives me errors, as I suspect it 'map' does not work with 'wget.download'?
My code from the module:
from multiprocessing.dummy import Pool as ThreadPool
import urllib.parse
#define the needed data
function='TIME_SERIES_INTRADAY_EXTENDED'
symbol='IBM'
interval='1min'
slice='year1month1'
adjusted='true'
apikey= key[0].rstrip()
#create URL
SCHEME = os.environ.get("API_SCHEME", "https")
NETLOC = os.environ.get("API_NETLOC", "www.alphavantage.co") #query?
PATH = os.environ.get("API_PATH","query")
query = urllib.parse.urlencode(dict(function=function, symbol=symbol, interval=interval, slice=slice, adjusted=adjusted, apikey=apikey))
url = urllib.parse.urlunsplit((SCHEME, NETLOC,PATH, query, ''))
#this is my original code to download the data (working but slow and stopping the rest of the script)
wget.download(url, 'C:\\Users\\x\\Desktop\\Tool\\RAWdata\\test.csv')
#this is my attempt to speed things up via multithreading from code
pool = ThreadPool(4)
if __name__ == '__main__':
futures = []
for x in range(1):
futures.append(pool.apply_async(wget.download, url,'C:\\Users\\x\\Desktop\\Tool\\RAWdata\\test.csv']))
# futures is now a list of 10 futures.
for future in futures:
print(future.get())
any suggestions or do you see the error i make?
ok, i figured it out, so i will leave it here in case someone else needs it.
I made a module called APIcall which has a function APIcall() which uses wget.download() to download my data.
in main, i create a function (called threaded_APIfunc) which calls the APIcall() function in my modul APIcall
import threading
import APIcall
def threaded_APIfunc():
APIcall.APIcall(function, symbol, interval, slice, adjusted, apikey)
print ("Data Download complete for ${}".format(symbol))
and then i run the threaded_APIfunc within a thread like so
threading.Thread(target=threaded_APIfunc).start()
print ('Start Downloading Data for ${}'.format(symbol))
what happends is, that the .csv file gets downloaded in the background, while the main loop doesent wait till the download ir completed, it does the code what comes after the threading right away

Add multithreading or asynchronous to web scrape

What is the best way to implement multithreading to make a web scrape faster?
Would using Pool be a good solution - if so where in my code would I implement it?
import requests
from multiprocessing import Pool
with open('testing.txt', 'w') as outfile:
results = []
for number in (4,8,5,7,3,10):
url = requests.get('https://www.google.com/' + str(number))
response =(url)
results.append(response.text)
print(results)
outfile.write("\n".join(results))
This can be moved to a pool easily. Python comes with process and thread based pools. Which to use is a tradeoff. Processes work better for parallelizing running code but are more expensive when passing results back to the main program. In your case, your code is mostly waiting on urls and has a relatively large return object, so thread pools make sense.
I moved the code inside a if __name__ as needed on windows machines.
import requests
from multiprocessing import Pool
from multiprocessing.pool import ThreadPool
def worker(number):
url = requests.get('https://www.google.com/' + str(number))
return url.text
# put some sort of cap on outstanding requests...
MAX_URL_REQUESTS = 10
if __name__ == "__main__":
numbers = (4,8,5,7,3,10)
with ThreadPool(min(len(numbers), MAX_URL_REQUESTS)) as pool:
with open('testing.txt', 'w') as outfile:
for result in pool.map(worker, numbers, chunksize=1):
outfile.write(result)
outfile.write('\n')

Handling multiple http request in Python

I am mining data from a website through Data Scraping in Python. I am using request package for sending the parameters.
Here is the code snippet in Python:
for param in paramList:
data = get_url_data(param)
def get_url_data(param):
post_data = get_post_data(param)
headers = {}
headers["Content-Type"] = "text/xml; charset=UTF-8"
headers["Content-Length"] = len(post_data)
headers["Connection"] = 'Keep-Alive'
headers["Cache-Control"] = 'no-cache'
page = requests.post(url, data=post_data, headers=headers, timeout=10)
data = parse_page(page.content)
return data
The variable paramList is a list of more than 1000 elements and the endpoint url remains the same. I was wondering if there is a better and more faster way to do this ?
Thanks
As there is a significant amount of networking I/O involved, threading should improve the overall performance significantly.
You can try using a ThreadPool and should test and tweak the number of threads to a one that is best suitable for the situation and shows the overall highest performance .
from multiprocessing.pool import ThreadPool
# Remove 'for param in paramList' iteration
def get_url_data(param):
# Rest of code here
if __name__ == '__main__':
pool = ThreadPool(15)
pool.map(get_url_data, paramList) # Will split the load between the threads nicely
pool.close()
I need to make 1000 post request to same domain, I was wondering if
there is a better and more faster way to do this ?
It depends, if it's a static asset or a servlet which you know what it does, if the same parameters will return the same reponse each time you can implement LRU or some other caching mechanism, if not, 1K of POST requests to some servlet doesn't matter even if they have the same domain.
There is an answer with using multiprocessing whith ThreadPool interface, which actually uses the main process with 15 threads, does it runs on 15 cores machine ? because a core can only run one thread each time (except hyper ones, does it run on 8 hyper-cores?)
ThreadPool interface inside library which has a trivial name, multiprocessing, because python has also threading module, this is confusing as f#ck, lets benchmark some lower level code:
import psutil
from multiprocessing.pool import ThreadPool
from time import sleep
def get_url_data(param):
print(param) # just for convenience
sleep(1) # lets assume it will take one second each time
if __name__ == '__main__':
paramList = [i for i in range(100)] # 100 urls
pool = ThreadPool(psutil.cpu_count()) # each core can run one thread (hyper.. not now)
pool.map(get_url_data, paramList) # splitting the jobs
pool.close()
The code above will use the main process with 4 threads in my case because my laptop has 4 CPUs, benchmark result:
$ time python3_5_2 test.py
real 0m28.127s
user 0m0.080s
sys 0m0.012s
Lets try spawning processes w/ multiprocessing
import psutil
import multiprocessing
from time import sleep
import numpy
def get_url_data(urls):
for url in urls:
print(url)
sleep(1) # lets assume it will take one second each time
if __name__ == "__main__":
jobs = []
# Split URLs into chunks as number of CPUs
chunks = numpy.array_split(range(100), psutil.cpu_count())
# Pass each chunk into process
for url_chunk in chunks:
jobs.append(multiprocessing.Process(target=get_url_data, args=(url_chunk, )))
# Start the processes
for j in jobs:
j.start()
# Ensure all of the processes have finished
for j in jobs:
j.join()
Benchmark result: less 3 seconds
$ time python3_5_2 test2.py
real 0m25.208s
user 0m0.260s
sys 0m0.216
If you will execute ps -aux | grep "test.py" you will see 5 processes because one is the main which manage the others.
There are some drawbacks:
You did not explain in depth what your code is doing, but if you doing some work which needs to be synchronized you need to know multiprocessing is NOT thread safe.
Spawning extra processes introduces I/O overhead as data is having to be shuffled around between processors.
Assuming the data is restricted to each process, it is possible to gain significant speedup, be aware of Amdahl's Law.
If you will reveal what your code does afterwards ( save it into file ? database ? stdout ? ) it will be easier to give better answer/direction, few ideas comes up to my mind like immutable infrastructure with Bash or Java to handle synchronization or is it a memory-bound issue and you need an objects pool to process the JSON responses.. might even be a job for fault tolerance Elixir)

Is there a way to run cpython on a diffident thread without risking a crash?

I have a program that runs lots of urllib requests IN AN INFINITE LOOP, which makes my program really slow, so I tried putting them as threads. Urllib uses cpython deep down in the socket module, so the threads that are being created just add up and do nothing because python's GIL prevents a two cpython commands from being executed in diffident threads at the same time. I am running Windows XP with Python 2.5, so I can't use the multiprocess module. I tried looking at the subproccess module to see if there was a way to execute python code in a subprocess somehow, but nothing. If anyone has a way that I can create a python subprocess through a function like in the multiprocess, that would be great.
Also, I would rather not download an external module, but I am willing to.
EDIT: Here is a sample of some code in my current program.
url = "http://example.com/upload_image.php?username=Test&password=test"
url = urllib.urlopen(url, data=urllib.urlencode({"Image": raw_image_data})).read()
if url.strip().replace("\n", "") != "":
print url
I did a test and it turns out that urllib2's urlopen with the Request object and without is still as slow or slower. I created my own custom timeit like module and the above takes around 0.5-2 seconds, which is horrible for what my program does.
Urllib uses cpython deep down in the socket module, so the threads
that are being created just add up and do nothing because python's GIL
prevents a two cpython commands from being executed in diffident
threads at the same time.
Wrong. Though It is a common misconception. CPython can and do release GIL for IO-operations (look at all Py_BEGIN_ALLOW_THREADS in the socketmodule.c). While one thread waits for IO to complete other threads can do some work. If urllib calls are the bottleneck in your script then threads may be one of the acceptable solutions.
I am running Windows XP with Python 2.5, so I can't use the multiprocess module.
You could install Python 2.6 or newer or if you must use Python 2.5; you could install multiprocessing separately.
I created my own custom timeit like module and the above takes around 0.5-2 seconds, which is horrible for what my program does.
The performance of urllib2.urlopen('http://example.com...).read() depends mostly on outside factors such as DNS, network latency/bandwidth, performance of example.com server itself.
Here's an example script which uses both threading and urllib2:
import urllib2
from Queue import Queue
from threading import Thread
def check(queue):
"""Check /n url."""
opener = urllib2.build_opener() # if you use install_opener in other threads
for n in iter(queue.get, None):
try:
data = opener.open('http://localhost:8888/%d' % (n,)).read()
except IOError, e:
print("error /%d reason %s" % (n, e))
else:
"check data here"
def main():
nurls, nthreads = 10000, 10
# spawn threads
queue = Queue()
threads = [Thread(target=check, args=(queue,)) for _ in xrange(nthreads)]
for t in threads:
t.daemon = True # die if program exits
t.start()
# provide some work
for n in xrange(nurls): queue.put_nowait(n)
# signal the end
for _ in threads: queue.put(None)
# wait for completion
for t in threads: t.join()
if __name__=="__main__":
main()
To convert it to a multiprocessing script just use different imports and your program will use multiple processes:
from multiprocessing import Queue
from multiprocessing import Process as Thread
# the rest of the script is the same
If you want multi threading, Jython could be an option, as it doesn't have a GIL.
I concur with #Jan-Philip and #Piotr. What are you using urllib for?

Python 2.5 - multi-threaded for loop

I've got a piece of code:
for url in get_lines(file):
visit(url, timeout=timeout)
It gets URLs from file and visit it (by urllib2) in for loop.
Is is possible to do this in few threads? For example, 10 visits at the same time.
I've tried:
for url in get_lines(file):
Thread(target=visit, args=(url,), kwargs={"timeout": timeout}).start()
But it does not work - no effect, URLs are visited normally.
The simplified version of function visit:
def visit(url, proxy_addr=None, timeout=30):
(...)
request = urllib2.Request(url)
response = urllib2.urlopen(request)
return response.read()
To expand on senderle's answer, you can use the Pool class in multiprocessing to do this easily:
from multiprocessing import Pool
pool = Pool(processes=5)
pages = pool.map(visit, get_lines(file))
When the map function returns then "pages" will be a list of the contents of the URLs. You can adjust the number of processes to whatever is suitable for your system.
I suspect that you've run into the Global Interpreter Lock. Basically, threading in python can't achieve concurrency, which seems to be your goal. You need to use multiprocessing instead.
multiprocessing is designed to have a roughly analogous interface to threading, but it has a few quirks. Your visit function as written above should work correctly, I believe, because it's written in a functional style, without side effects.
In multiprocessing, the Process class is the equivalent of the Thread class in threading. It has all the same methods, so it's a drop-in replacement in this case. (Though I suppose you could use pool as JoeZuntz suggests -- but I would test with the basic Process class first, to see if it fixes the problem.)

Categories

Resources