I am mining data from a website through Data Scraping in Python. I am using request package for sending the parameters.
Here is the code snippet in Python:
for param in paramList:
data = get_url_data(param)
def get_url_data(param):
post_data = get_post_data(param)
headers = {}
headers["Content-Type"] = "text/xml; charset=UTF-8"
headers["Content-Length"] = len(post_data)
headers["Connection"] = 'Keep-Alive'
headers["Cache-Control"] = 'no-cache'
page = requests.post(url, data=post_data, headers=headers, timeout=10)
data = parse_page(page.content)
return data
The variable paramList is a list of more than 1000 elements and the endpoint url remains the same. I was wondering if there is a better and more faster way to do this ?
Thanks
As there is a significant amount of networking I/O involved, threading should improve the overall performance significantly.
You can try using a ThreadPool and should test and tweak the number of threads to a one that is best suitable for the situation and shows the overall highest performance .
from multiprocessing.pool import ThreadPool
# Remove 'for param in paramList' iteration
def get_url_data(param):
# Rest of code here
if __name__ == '__main__':
pool = ThreadPool(15)
pool.map(get_url_data, paramList) # Will split the load between the threads nicely
pool.close()
I need to make 1000 post request to same domain, I was wondering if
there is a better and more faster way to do this ?
It depends, if it's a static asset or a servlet which you know what it does, if the same parameters will return the same reponse each time you can implement LRU or some other caching mechanism, if not, 1K of POST requests to some servlet doesn't matter even if they have the same domain.
There is an answer with using multiprocessing whith ThreadPool interface, which actually uses the main process with 15 threads, does it runs on 15 cores machine ? because a core can only run one thread each time (except hyper ones, does it run on 8 hyper-cores?)
ThreadPool interface inside library which has a trivial name, multiprocessing, because python has also threading module, this is confusing as f#ck, lets benchmark some lower level code:
import psutil
from multiprocessing.pool import ThreadPool
from time import sleep
def get_url_data(param):
print(param) # just for convenience
sleep(1) # lets assume it will take one second each time
if __name__ == '__main__':
paramList = [i for i in range(100)] # 100 urls
pool = ThreadPool(psutil.cpu_count()) # each core can run one thread (hyper.. not now)
pool.map(get_url_data, paramList) # splitting the jobs
pool.close()
The code above will use the main process with 4 threads in my case because my laptop has 4 CPUs, benchmark result:
$ time python3_5_2 test.py
real 0m28.127s
user 0m0.080s
sys 0m0.012s
Lets try spawning processes w/ multiprocessing
import psutil
import multiprocessing
from time import sleep
import numpy
def get_url_data(urls):
for url in urls:
print(url)
sleep(1) # lets assume it will take one second each time
if __name__ == "__main__":
jobs = []
# Split URLs into chunks as number of CPUs
chunks = numpy.array_split(range(100), psutil.cpu_count())
# Pass each chunk into process
for url_chunk in chunks:
jobs.append(multiprocessing.Process(target=get_url_data, args=(url_chunk, )))
# Start the processes
for j in jobs:
j.start()
# Ensure all of the processes have finished
for j in jobs:
j.join()
Benchmark result: less 3 seconds
$ time python3_5_2 test2.py
real 0m25.208s
user 0m0.260s
sys 0m0.216
If you will execute ps -aux | grep "test.py" you will see 5 processes because one is the main which manage the others.
There are some drawbacks:
You did not explain in depth what your code is doing, but if you doing some work which needs to be synchronized you need to know multiprocessing is NOT thread safe.
Spawning extra processes introduces I/O overhead as data is having to be shuffled around between processors.
Assuming the data is restricted to each process, it is possible to gain significant speedup, be aware of Amdahl's Law.
If you will reveal what your code does afterwards ( save it into file ? database ? stdout ? ) it will be easier to give better answer/direction, few ideas comes up to my mind like immutable infrastructure with Bash or Java to handle synchronization or is it a memory-bound issue and you need an objects pool to process the JSON responses.. might even be a job for fault tolerance Elixir)
Related
I have a question about running function parallel in Python.
I have tried using multi processing to reduce time sending and receiving data from API but when I execute code below, it tend to crash my IDE.
def network_request_function(value)
#this function sends requests using value.
for i in list:
p1 = Process(target=network_request_function, args=(i,))
p1.start()
Can you provide a way to fix my code?
Or are there better alternatives?
You should specify what platform this is running on what your IDE is. Also, if all network_request_function is doing is making a network request and awaiting a reply which gets no further processing requiring intensive CPU, then this seems like it should be using multithreading instead of multiprocessing and a multithreading pool where the number of concurrent threads can be limited in case the length of your input list is very large and where it is simpler to get a return value from network_request_function that you might be interested in. And you should not use a name, such as list, that happens to be the name of a built-in function or class for naming a variable.
For example:
def network_request_function(value):
#this function sends requests using value and returns the reply
return reply
if __name__ == '__main__': # Required if we switch to multiprocessing
# To use multiprocessing:
#from multiprocessing.pool import Pool as Executor
# To use multithreading:
from multiprocessing.pool import ThreadPool as Executor
# inputs is our list of value arguments used with network_request_function:
inputs = []; # This variable is set somewhere
# May need to be a smaller number if we are using multiprocessing and
# depending on the platform:
MAX_POOL_SIZE = 200
pool_size = min(len(inputs), MAX_POOL_SIZE)
with Executor(pool_size) as pool:
# Get list of all replies:
replies = pool.map(network_request_function, inputs)
I am trying to utilize Python's multiprocessing library to quickly run a function using the 8 processing cores I have on a Linux VM I created. As a test, I am getting the time in seconds it takes for a worker pool with 4 processes to run a function, and the time it takes running the same function without using a worker pool. The time in seconds is coming out as about the same, in some case it is taking the worker pool much longer to process than without.
Script
import requests
import datetime
import multiprocessing as mp
shared_results = []
def stress_test_url(url):
print('Starting Stress Test')
count = 0
while count <= 200:
response = requests.get(url)
shared_results.append(response.status_code)
count += 1
pool = mp.Pool(processes=4)
now = datetime.datetime.now()
results = pool.apply(stress_test_url, args=(url,))
diff = (datetime.datetime.now() - now).total_seconds()
now = datetime.datetime.now()
results = stress_test_url(url)
diff2 = (datetime.datetime.now() - now).total_seconds()
print(diff)
print(diff2)
Terminal Output
Starting Stress Test
Starting Stress Test
44.316212
41.874116
The apply function of multiprocessing.Pool simply runs a function in a separate process and waits for its results. It takes a little bit more than running sequentially as it needs to pack the job to be processed and ship it to the child process via a pipe.
multiprocessing doesn't make sequential operations faster, it simply allows them to be run in parallel if you hardware has more than one core.
Just try this:
urls = ["http://google.com",
"http://example.com",
"http://stackoverflow.com",
"http://python.org"]
results = pool.map(stress_test_url, urls)
You will see that the 4 URLs get visited seemingly at the same time. This means your logic reduces the amount of time necessary to visit N websites to N / processes.
Lastly, benchmarking a function which performs an HTTP request is a very poor way to measure performance as networks are unreliable. You will hardly get two executions which take the same amount of time no matter whether you use multiprocessing or not.
first question on stack overflow so please bear with. I am looking to calculate the variance for group ratings (long numpy arrays). Running the program without parallel processing works fine, but given each process can run independently and there are 32 groups I am looking to make use of multiprocessing to speed things up. This works OK for small numbers of groups < 10, but after this the program will often just seemingly stop running with no error messages at an unspecified number of groups ( usually between 20 and 30 ) although less frequently will run all the way through. The arrays are quite large ( 21451 x 11462 user item ratings) and so I am wondering if the problem is caused by not enough memory, although no error messages are printed.
import numpy as np
from functools import partial
import multiprocessing
def variance_parallel(extra_matrices, group_num):
# do some variation calculation
# print confirmation that we have entered function, and group number
return single_group_var
def variance(extra_matrices, num_groups):
variance_partial = partial(variance_parallel, extra_matrices)
for g in list(range(num_groups)):
group_var = pool.map(variance_partial,range(g))
return(group_var)
num_cores = multiprocessing.cpu_count() - 1
pool = multiprocessing.Pool(processes=num_cores)
variance(extra_matrices, num_groups)
Running the above code shows the program progressively building the number of groups it is checking variance on ([0],[0,1],[0,1,2],...) before eventually printing nothing.
Thanks in advance for any help and apologies if my formatting / question is a bit off!
Multiple processes do not share data
Data sent to processes needs to be copied
Since the arrays are large, the issue is very likely to do with said copying of large arrays to the processes. Further more in Python's multiprocessing, sending data to processes is done by serialisation which is (a) CPU intensive and (b) takes extra memory in and by it self.
In short multi processing is not a good fit for your use case. Since numpy is a native code extension (where GIL does not apply) and is thread safe, best to use threading instead of multiprocessing. With threading, the worker threads can share data via their parent process's address space which makes away with having to copy.
That should stop the program from running out of memory.
However, for threads to share address space the data they share needs to be bound to an object, like in a python class.
Something like the below - untested as the code sample is incomplete.
import numpy as np
from functools import partial
from threading import Thread
from multiprocessing import cpu_count
class Variance(Thread):
def __init__(self, extra_matrices, group_num):
Thread.__init__(self)
self.extra_matrices = extra_matrices
self.group_num = group_num
self.output = None
def run(self):
# do some variation calculation
# print confirmation that we have entered function, and group number
self.output = single_group_var
num_cores = cpu_count() - 1
results = []
for g in list(range(num_groups)):
workers = [Variance(extra_matrices, range(g))
for _ in range(num_cores)]
# Start threads
for worker in workers:
worker.start()
# Wait for completion
for worker in workers:
worker.join()
results.extend([w.output for w in workers])
print results
I have a python program that's been running for a while, and because of an unanticipated event, I'm now unsure that it will complete within a reasonable amount of time. The data it's collected so far, however, is valuable, and I would like to recover it if possible.
Here is the relevant code
from multiprocessing.dummy import Pool as ThreadPool
def pull_details(url):
#accesses a given URL
#returns some data which gets appended to the results list
pool = ThreadPool(25)
results = pool.map(pull_details, urls)
pool.close()
pool.join()
So I either need to access the data that is currently in results or somehow change the source of the code (or somehow manually change the program's control) to kill the loop so it continues to the later part of the program in which the data is exported (not sure if the second way is possible).
It seems as though the first option is also quite tricky, but luckily the IDE (Spyder) I'm using indicates the value of what I assume is the location of the list in the machine's memory (0xB73EDECCL).
Is it possible to create a C program (or another python program) to access this location in memory and read what's there?
Can't you use some sort of mechanism to exchange data between the two processes, like queues or pipes.
something like below:
from multiprocessing import Queue
from multiprocessing.dummy import Pool as ThreadPool
def pull_details(args=None):
q.put([my useful data])
q = Queue()
pool = ThreadPool(25)
results = pool.map(pull_details(args=q), urls)
while not done:
results = q.get()
pool.close()
pool.join()
I have a problem running multiple processes in python3 .
My program does the following:
1. Takes entries from an sqllite database and passes them to an input_queue
2. Create multiple processes that take items off the input_queue, run it through a function and output the result to the output queue.
3. Create a thread that takes items off the output_queue and prints them (This thread is obviously started before the first 2 steps)
My problem is that currently the 'function' in step 2 is only run as many times as the number of processes set, so for example if you set the number of processes to 8, it only runs 8 times then stops. I assumed it would keep running until it took all items off the input_queue.
Do I need to rewrite the function that takes the entries out of the database (step 1) into another process and then pass its output queue as an input queue for step 2?
Edit:
Here is an example of the code, I used a list of numbers as a substitute for the database entries as it still performs the same way. I have 300 items on the list and I would like it to process all 300 items, but at the moment it just processes 10 (the number of processes I have assigned)
#!/usr/bin/python3
from multiprocessing import Process,Queue
import multiprocessing
from threading import Thread
## This is the class that would be passed to the multi_processing function
class Processor:
def __init__(self,out_queue):
self.out_queue = out_queue
def __call__(self,in_queue):
data_entry = in_queue.get()
result = data_entry*2
self.out_queue.put(result)
#Performs the multiprocessing
def perform_distributed_processing(dbList,threads,processor_factory,output_queue):
input_queue = Queue()
# Create the Data processors.
for i in range(threads):
processor = processor_factory(output_queue)
data_proc = Process(target = processor,
args = (input_queue,))
data_proc.start()
# Push entries to the queue.
for entry in dbList:
input_queue.put(entry)
# Push stop markers to the queue, one for each thread.
for i in range(threads):
input_queue.put(None)
data_proc.join()
output_queue.put(None)
if __name__ == '__main__':
output_results = Queue()
def output_results_reader(queue):
while True:
item = queue.get()
if item is None:
break
print(item)
# Establish results collecting thread.
results_process = Thread(target = output_results_reader,args = (output_results,))
results_process.start()
# Use this as a substitute for the database in the example
dbList = [i for i in range(300)]
# Perform multi processing
perform_distributed_processing(dbList,10,Processor,output_results)
# Wait for it all to finish.
results_process.join()
A collection of processes that service an input queue and write to an output queue is pretty much the definition of a process pool.
If you want to know how to build one from scratch, the best way to learn is to look at the source code for multiprocessing.Pool, which is pretty simply Python, and very nicely written. But, as you might expect, you can just use multiprocessing.Pool instead of re-implementing it. The examples in the docs are very nice.
But really, you could make this even simpler by using an executor instead of a pool. It's hard to explain the difference (again, read the docs for both modules), but basically, a future is a "smart" result object, which means instead of a pool with a variety of different ways to run jobs and get results, you just need a dumb thing that doesn't know how to do anything but return futures. (Of course in the most trivial cases, the code looks almost identical either way…)
from concurrent.futures import ProcessPoolExecutor
def Processor(data_entry):
return data_entry*2
def perform_distributed_processing(dbList, threads, processor_factory):
with ProcessPoolExecutor(processes=threads) as executor:
yield from executor.map(processor_factory, dbList)
if __name__ == '__main__':
# Use this as a substitute for the database in the example
dbList = [i for i in range(300)]
for result in perform_distributed_processing(dbList, 8, Processor):
print(result)
Or, if you want to handle them as they come instead of in order:
def perform_distributed_processing(dbList, threads, processor_factory):
with ProcessPoolExecutor(processes=threads) as executor:
fs = (executor.submit(processor_factory, db) for db in dbList)
yield from map(Future.result, as_completed(fs))
Notice that I also replaced your in-process queue and thread, because it wasn't doing anything but providing a way to interleave "wait for the next result" and "process the most recent result", and yield (or yield from, in this case) does that without all the complexity, overhead, and potential for getting things wrong.
Don't try to rewrite the whole multiprocessing library again. I think you can use any of multiprocessing.Pool methods depending on your needs - if this is a batch job you can even use the synchronous multiprocessing.Pool.map() - only instead of pushing to input queue, you need to write a generator that yields input to the threads.