multithreading web scraping

multithreading web scraping - python

I wrote a code for getting links from web. Running this code takes about 2:20 min which is a lot since it's only a function in the code. I would like to make it more efficient. I thought about multi-threading but I have trouble understanding it deeply and applying it to this code
def get_manufacturer():
manufacturers = requests.get("https://www.gsmarena.com/")
res = re.findall(r"<li><a href=\"samsung-phones-9.php\">.+\n", manufacturers.text)
manufacturer_links = re.findall(r"<li><a href=\"(.+?)\">", res[0])
final_list = []
for i in range(len(manufacturer_links)):
final_list.append("https://www.gsmarena.com/" + manufacturer_links[i])
# find pages
for i in final_list:
req = requests.get(i)
res2 = re.findall(r"<strong>1</strong>(.+)</div>", req.text)
for k in res2:
if k is not None:
pages = re.findall(r"<a href=\"(.+?)\">.<\/a>", res2[0])
for j in range(len(pages)):
final_list.append("https://www.gsmarena.com/" + pages[j])
return final_list

You can run your for loop in parallel below is an example
import multiprocessing as mul
def calcIntOfnth(i,ppStr,c,znot):
pool = mul.Pool(mul.cpu_count())
results = pool.starmap(calcIntOfnth, [(i,ppStr,c,znot) for i in range(k)]) # other parameters are local to this statement i.e. ppStr,c,znot,k
pool.close()
You'll be needed to re-write one of your for loop as a function and use Pool object or other similiar way to run it in parallel.

Related

How can you speed up repeated API calls?

For a detailed understanding, I have attached a link of file:
Actual file
I have data in the list, that has similar syntax like:
i = [a.b>c.d , e.f.g>h.i.j ]
e.g. Each element of the list has multiple sub-elements separated by "." and ">".
for i in range(0, len(l)):
reac={}
reag={}
t = l[i].split(">")
REAC = t[0]
Reac = REAC.split(".")
for o in range(len(Reac)):
reaco = "https://ai.chemistryinthecloud.com/smilies/" + Reac[o]
respo = requests.get(reaco)
reac[o] ={"Smile":Reac[o],"Details" :respo.json()}
if (len(t) != 1):
REAG = t[1]
Reag = REAG.split(".")
for k in range(len(Reag)):
reagk = "https://ai.chemistryinthecloud.com/smilies/" + Reag[k]
repo = requests.get(reagk)
reag[k] = {"Smile": Reag[k], "Details" :repo.json()}
res = {"Reactants": list(reac.values()), "Reagents": list(reag.values())}
boo.append(res)
else:
res = {"Reactants": list(reac.values()), "Reagents": "No reagents"}
boo.append(res)
We have separated all the elements and for each element, we are calling 3rd party API. That consumes too much time.
Is there any way to reduce this time and optimize for the loop?
It takes around 1 minute to respond. We want to optimize to 5-10 seconds.

Just do this instead. I changed the format of your output because it made things unnecessisarily complicated, and removed a lot of non-pythonic things you were doing. The second time you run should be instantaneous, and the first time should be much faster if you have repeated values across equations or sides of the reaction.
# pip install cachetools diskcache
from cachetools import cached
from diskcache import Cache
BASE_URL = "https://ai.chemistryinthecloud.com/smilies/"
CACHEDIR = "api_cache"
#cached(cache=Cache(CACHEDIR))
def get_similies(value):
return requests.get(BASE_URL + value).json()
def parse_eq(eq):
reac, _, reag = (set(v.split(".")) for v in eq.partition(">"))
return {
"reactants": {i: get_similies(i) for i in reac},
"reagents": {i: get_similies(i) for i in reag}
}
result = [parse_eq(eq) for eq in eqs]

How can i iterate over a large list quickly?

I am trying to iterate over a large list. I want a method that can iterate this list quickly. But it takes much time to iterate. Is there any method to iterate quickly or python is not built to do this.
My code snippet is :-
for i in THREE_INDEX:
if check_balanced(rc, pc):
print('balanced')
else:
rc, pc = equation_suffix(rc, pc, i)
Here THREE_INDEX has a length of 117649. It takes much time to iterate over this list, is there any method to iterate it quicker. But it takes around 4-5 minutes to iterate
equation_suffix functions:
def equation_suffix(rn, pn, suffix_list):
len_rn = len(rn)
react_suffix = suffix_list[: len_rn]
prod_suffix = suffix_list[len_rn:]
for re in enumerate(rn):
rn[re[0]] = add_suffix(re[1], react_suffix[re[0]])
for pe in enumerate(pn):
pn[pe[0]] = add_suffix(pe[1], prod_suffix[pe[0]])
return rn, pn
check_balanced function:
def check_balanced(rl, pl):
total_reactant = []
total_product = []
reactant_name = []
product_name = []
for reactant in rl:
total_reactant.append(separate_num(separate_brackets(reactant)))
for product in pl:
total_product.append(separate_num(separate_brackets(product)))
for react in total_reactant:
for key in react:
val = react.get(key)
val_dict = {key: val}
reactant_name.append(val_dict)
for prod in total_product:
for key in prod:
val = prod.get(key)
val_dict = {key: val}
product_name.append(val_dict)
reactant_name = flatten_dict(reactant_name)
product_name = flatten_dict(product_name)
for elem in enumerate(reactant_name):
val_r = reactant_name.get(elem[1])
val_p = product_name.get(elem[1])
if val_r == val_p:
if elem[0] == len(reactant_name) - 1:
return True
else:
return False

I believe the reason why "iterating" the list take a long time is due to the methods you are calling inside the for loop. I took out the methods just to test the speed of the iteration, it appears that iterating through a list of size 117649 is very fast. Here is my test script:
import time
start_time = time.time()
new_list = [(1, 2, 3) for i in range(117649)]
end_time = time.time()
print(f"Creating the list took: {end_time - start_time}s")
start_time = time.time()
for i in new_list:
pass
end_time = time.time()
print(f"Iterating the list took: {end_time - start_time}s")
Output is:
Creating the list took: 0.005337953567504883s
Iterating the list took: 0.0035648345947265625s
Edit: time() returns second.

General for loops aren't an issue, but using them to build (or rebuild) lists is usually slower than using list comprehensions (or in some cases, map/filter, though those are advanced tools that are often a pessimization).
Your functions could be made significantly simpler this way, and they'd get faster to boot. Example rewrites:
def equation_suffix(rn, pn, suffix_list):
prod_suffix = suffix_list[len(rn):]
# Change `rn =` to `rn[:] = ` if you must modify the caller's list as in your
# original code, not just return the modified list (which would be fine in your original code)
rn = [add_suffix(r, suffix) for r, suffix in zip(rn, suffix_list)] # No need to slice suffix_list; zip'll stop when rn exhausted
pn = [add_suffix(p, suffix) for p, suffix in zip(pn, prod_suffix)]
return rn, pn
def check_balanced(rl, pl):
# These can be generator expressions, since they're iterated once and thrown away anyway
total_reactant = (separate_num(separate_brackets(reactant)) for reactant in rl)
total_product = (separate_num(separate_brackets(product)) for product in pl)
reactant_name = []
product_name = []
# Use .items() to avoid repeated lookups, and concat simple listcomps to reduce calls to append
for react in total_reactant:
reactant_name += [{key: val} for key, val in react.items()]
for prod in total_product:
product_name += [{key: val} for key, val in prod.items()]
# These calls are suspicious, and may indicate optimizations to be had on prior lines
reactant_name = flatten_dict(reactant_name)
product_name = flatten_dict(product_name)
for i, (elem, val_r) in enumerate(reactant_name.items()):
if val_r == product_name.get(elem):
if i == len(reactant_name) - 1:
return True
else:
# I'm a little suspicious of returning False the first time a single
# key's value doesn't match. Either it's wrong, or it indicates an
# opportunity to write short-circuiting code that doesn't have
# to fully construct reactant_name and product_name when much of the time
# there will be an early mismatch
return False
I'll also note that using enumerate without unpacking the result is going to get worse performance, and more cryptic code; in this case (and many others), enumerate isn't needed, as listcomps and genexprs can accomplish the same result without knowing the index, but when it is needed, always unpack, e.g. for i, elem in enumerate(...): then using i and elem separately will always run faster than for packed in enumerate(...): and using packed[0] and packed[1] (and if you have more useful names than i and elem, it'll be much more readable to boot).

Parallel Processing in Python with nested loop

Due to performance issue, i would like to run in parallel my function in python :
import multiprocessing as mp
source_nodes = [10413173, 10414530, 10414530, 10437199]
sink_nodes = [10420346, 10438770, 10438711, 10414530, 10436258]
path =[]
def createpath(source,sink):
for i in source:
for j in sink:
path = path + list(nx.all_simple_paths(Directed_G,i,j))
return path
From my understanding i must give 1 iterable to apply function. but my idea was to do something like :
results = [pool.apply(createpath, args=(source_nodes, sink_nodes))]
And then don't give any iterable object to applyfunction
I managed to get it work, but i don't think it run on parallel.
Do you think i should include the apply function inside the first loop ?

from multiprocessing import Pool
source_nodes = [1,2,3,4,5,6]
sink_nodes = [1,1,1,1,1,1,1,1,1]
def sum_values(parameter_tuple):
source,sink, start, stop = parameter_tuple
out = 0
for i in range(start, stop):
val_i = source[i]
for j in sink:
out += val_i*j
return out
if __name__ == "__main__":
params = (source_nodes, sink_nodes, 0, 6)
print(sum_values(params))
with Pool(2) as p:
print(p.map(sum_values, [
(source_nodes, sink_nodes, 0, 3),
(source_nodes, sink_nodes, 3, 6),
]))
You can try to run this one. This runs parallel with map pattern on pool of 2 threads. In this case your output result is the sum of result of each process from pool.

Multi-threaded Python application slower than single-threaded implementation

I wrote this program to properly learn how to use multi-threading. I want to implement something similar to this in my own program:
import numpy as np
import time
import os
import math
import random
from threading import Thread
def powExp(x, r):
for c in range(x.shape[1]):
x[r][c] = math.pow(100, x[r][c])
def main():
print()
rows = 100
cols = 100
x = np.random.random((rows, cols))
y = x.copy()
start = time.time()
threads = []
for r in range(x.shape[0]):
t = Thread(target = powExp, args = (x, r))
threads.append(t)
t.start()
for t in threads:
t.join()
end = time.time()
print("Multithreaded calculation took {n} seconds!".format(n = end - start))
start = time.time()
for r in range(y.shape[0]):
for c in range(y.shape[1]):
y[r][c] = math.pow(100, y[r][c])
end = time.time()
print("Singlethreaded calculation took {n} seconds!".format(n = end - start))
print()
randRow = random.randint(0, rows - 1)
randCol = random.randint(0, cols - 1)
print("Checking random indices in x and y:")
print("x[{rR}][{rC}]: = {n}".format(rR = randRow, rC = randCol, n = x[randRow][randCol]))
print("y[{rR}][{rC}]: = {n}".format(rR = randRow, rC = randCol, n = y[randRow][randCol]))
print()
for r in range(x.shape[0]):
for c in range(x.shape[1]):
if(x[r][c] != y[r][c]):
print("ERROR NO WORK WAS DONE")
print("x[{r}][{c}]: {n} == y[{r}][{c}]: {ny}".format(
r = r,
c = c,
n = x[r][c],
ny = y[r][c]
))
quit()
assert(np.array_equal(x, y))
if __name__ == main():
main()
As you can see from the code the goal here is to parallelize the operation math.pow(100, x[r][c]) by creating a thread for every column. However this code is extremely slow, a lot slower than single-threaded versions.
Output:
Multithreaded calculation took 0.026447772979736328 seconds!
Singlethreaded calculation took 0.006798267364501953 seconds!
Checking random indices in x and y:
x[58][58]: = 9.792315687115973
y[58][58]: = 9.792315687115973
I searched through stackoverflow and found some info about the GIL forcing python bytecode to be executed on a single core only. However I'm not sure that this is in fact what is limiting my parallelization. I tried rearranging the parallelized for-loop using pools instead of threads. Nothing seems to be working.
Python code performance decreases with threading
EDIT: This thread discusses the same issue. Is it completely impossible to increase performance using multi-threading in python because of the GIL? Is the GIL causing my slowdowns?
EDIT 2 (2017-01-18): So from what I can gather after searching for quite a bit online it seems like python is really bad for parallelism. What I'm trying to do is parellelize a python function used in a neural network implemented in tensorflow...it seems like adding a custom op is the way to go.

The number of issues here is quite... numerous. Too many (system!) threads that do too little work, the GIL, etc. This is what I consider a really good introduction to parallelism in Python:
https://www.youtube.com/watch?v=MCs5OvhV9S4
Live coding is awesome.

Python multiprocessing and shared numpy array

I have a problem, which is similar to this:
import numpy as np
C = np.zeros((100,10))
for i in range(10):
C_sub = get_sub_matrix_C(i, other_args) # shape 10x10
C[i*10:(i+1)*10,:10] = C_sub
So, apparently there is no need to run this as a serial calculation, since each submatrix can be calculated independently.
I would like to use the multiprocessing module and create up to 4 processes for the for loop.
I read some tutorials about multiprocessing, but wasn't able to figure out how to use this to solve my problem.
Thanks for your help

A simple way to parallelize that code would be to use a Pool of processes:
pool = multiprocessing.Pool()
results = pool.starmap(get_sub_matrix_C, ((i, other_args) for i in range(10)))
for i, res in enumerate(results):
C[i*10:(i+1)*10,:10] = res
I've used starmap since the get_sub_matrix_C function has more than one argument (starmap(f, [(x1, ..., xN)]) calls f(x1, ..., xN)).
Note however that serialization/deserialization may take significant time and space, so you may have to use a more low-level solution to avoid that overhead.
It looks like you are running an outdated version of python. You can replace starmap with plain map but then you have to provide a function that takes a single parameter:
def f(args):
return get_sub_matrix_C(*args)
pool = multiprocessing.Pool()
results = pool.map(f, ((i, other_args) for i in range(10)))
for i, res in enumerate(results):
C[i*10:(i+1)*10,:10] = res

The following recipe perhaps can do the job. Feel free to ask.
import numpy as np
import multiprocessing
def processParallel():
def own_process(i, other_args, out_queue):
C_sub = get_sub_matrix_C(i, other_args)
out_queue.put(C_sub)
sub_matrices_list = []
out_queue = multiprocessing.Queue()
other_args = 0
for i in range(10):
p = multiprocessing.Process(
target=own_process,
args=(i, other_args, out_queue))
procs.append(p)
p.start()
for i in range(10):
sub_matrices_list.extend(out_queue.get())
for p in procs:
p.join()
return sub_matrices_list
C = np.zeros((100,10))
result = processParallel()
for i in range(10):
C[i*10:(i+1)*10,:10] = result[i]

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

multithreading web scraping - python

Related

How can you speed up repeated API calls?

How can i iterate over a large list quickly?

Parallel Processing in Python with nested loop

Multi-threaded Python application slower than single-threaded implementation

Python multiprocessing and shared numpy array

Categories

Resources