Threading still takes a very long time - python

I have made a script which constructs a checkout URL for shopify websites. This is done by appending each unique product 'variant' ID in the checkout URL and then opening the said URL in a webbrowser. To find the variant ID, i need to parse the website's sitemap to obtain the ID, which I am currenly doing in seperate threads for each product i am parsing, however with each thread added the time it takes increases by quite a lot (nearly one second).
Why is this the case? Shouldn't it take around the same time since each thread basically does the same exact thing?
For reference, one thread takes around 2.0s, two threads 2.8s and three threads around 3.8s
Here is my code:
import time
import requests
from bs4 import BeautifulSoup
import webbrowser
import threading
sitemap2 = 'https://deadstock.ca/sitemap_products_1.xml'
atc_url = 'https://deadstock.ca/cart/'
# CHANGE SITEMAP TO THE CORRECT ONE (THE SITE YOU ARE SCRAPING)
variant_list = []
def add_to_cart(keywords, size):
init = time.time()
# Initialize session
product_url = ''
parse_session = requests.Session()
response = parse_session.get(sitemap2)
soup = BeautifulSoup(response.content, 'lxml')
variant_id = 0
# Find Item
for urls in soup.find_all('url'):
for images in urls.find_all('image:image'):
if all(i in images.find('image:title').text.lower() for i in keywords):
now = time.time()
product_name = images.find('image:title').text
print('FOUND: ' + product_name + ' - ' + str(format(now-init, '.3g')) + 's')
product_url = urls.find("loc").text
if product_url != '':
response1 = parse_session.get(product_url+".xml")
soup = BeautifulSoup(response1.content,'lxml')
for variants in soup.find_all('variant'):
if size in variants.find('title').text.lower():
variant_id = variants.find('id', type='integer').text
atc_link = str(variant_id)+':1'
print(atc_link)
variant_list.append(atc_link)
try:
print("PARSED PRODUCT: " + product_name)
except UnboundLocalError:
print("Retrying")
add_to_cart(keywords, size)
def open_checkout():
url = 'https://deadstock.ca/cart/'
for var in variant_list:
url = url + var + ','
webbrowser.open_new_tab(url)
# When initializing a new thread, only change the keywords in the args, and make sure you start and join the thread.
# Change sitemap in scraper.py to your websites' sitemap
# If the script finds multiple items, the first item will be opened so please try to be very specific yet accurate.
def main():
print("Starting Script")
init = time.time()
try:
t1 = threading.Thread(target=add_to_cart, args=(['alltimers','relations','t-shirt','white'],'s',))
t2 = threading.Thread(target=add_to_cart, args=(['alltimers', 'relations', 'maroon'],'s',))
t3 = threading.Thread(target=add_to_cart, args=(['brain', 'dead','melter'], 's',))
t1.start()
t2.start()
t3.start()
t1.join()
t2.join()
t3.join()
print(variant_list)
open_checkout()
except:
print("Product not found / not yet live. Retrying..")
main()
print("Time taken: " + str(time.time()-init))
if __name__ == '__main__':
main()

Question: ... one thread takes around 2.0s, two threads 2.8s and three threads around 3.8s
Regarding your example code, you are counting​ the sum of all threads.
As #asettouf pointed out, there is a overhead, mean you have to pay for it.
But I assume, doing this 3 tasks threaded will be faster as doing it one after the other.

Related

BFS on wikipedia pages is taking very long - Can someone help me analyze my code's runtime?

I am trying to perform BFS on Wikipedia pages. I believe I am implementing this correctly and in the best possible way run-time wise (keeping it to one thread), but it is taking quite a long time to find a connection between two articles. Here is my implementation:
marked = set()
queue = deque()
count = 0
def get_wikipedia_page(wiki_link):
url = BASE_URL + wiki_link
time.sleep(0.1)
req = requests.get(url)
soup = BeautifulSoup(req.text, 'html.parser')
return soup
def backtrace(parent, start, end):
boolean = False
ret = []
while boolean == False:
if end in parent:
ret.append(parent[end])
end = parent[end]
if end == start:
break;
ret.reverse()
return ret
def bfs(first_wiki_link, second_wiki_link):
print('Searching starting from ' + first_wiki_link + ' looking for ' + second_wiki_link + '...')
parent = {}
queue.append(first_wiki_link)
start_time = time.time()
#current_parent = first_wiki_link
while queue:
link = queue.pop()
current_parent = link
link_list = list(filter(lambda c: c.get('href') != None and c.get('href').startswith('/wiki/') and c.get('href') != '/wiki/Main_Page' and ':' not in c.get('href'), get_wikipedia_page(link).findAll('a')))
for link in link_list:
href = link.get('href')
if not marked.__contains__(href):
parent[href] = current_parent
marked.add(href)
queue.append(href)
if href == second_wiki_link:
ret = backtrace(parent, first_wiki_link, second_wiki_link)
print("connection found")
end_time = time.time()
print("found a connection in: " + str(end_time - start_time) + " seconds.")
print("the path is " + str(len(ret)) + " pages long")
print(ret)
return True
It takes, sometimes, a few minutes before finding a match. Is this expected due to how big wikipedia is? Or am I messing something up here and am performing it in a non optimal way? Or, could it be beautiful soup is running too slow?
You are doing a dfs effectively, not a bfs. A dfs in a large graph like wikipedia content can take a very long time to look at the close neighbors of a node and will in most likeliness lead to longer times to find a connection, because chances are that those 2 pages are somehow related and thus close.
This part of the code:
while queue:
link = queue.pop() # <---- look at this
current_parent = link
link_list = list(filter...
for link in link_list:
This makes it a dfs because you are using the queue like a stack. A bfs needs a FIFO (first in first out) data structure, while you are doing a LIFO (last in first out), which is what dfs does.
This means that the first neighbor of the first page is looked at after you have looked at each and every page on wikipedia.
To solve it:
while queue:
link = queue.popleft() # <---- look at this
current_parent = link
link_list = list(filter...
for link in link_list:
This will look at the neighbors first and go forward breadth first search like you intend it to do.

Can't Stop ThreadPoolExecutor

I'm scraping hundreds of urls, each with a leaderboard of data I want, and the only difference between each url string is a 'platform','region', and lastly, the page number. There are only a few platforms and regions, but the page numbers change each day and I don't know how many there are. So that's the first function, I'm just creating lists of urls to be requested in parallel.
If I use page=1, then the result will contain 'table_rows > 0' in the last function. But around page=500, the requested url still pings back but very slowly and then it will show an error message, no leaderboard found, the last function will show 'table_rows == 0', etc. The problem is I need to get through the very last page and I want to do this quickly, hence the threadpoolexecutor - but I can't cancel all the threads or processes or whatever once PAGE_LIMIT is tripped. I threw the executor.shutdown(cancel_futures=True) just to kind of show what I'm looking for. If nobody can help me I'll miserably remove the parallelization and I'll scrape slowly, sadly, one url at a time...
Thanks
from concurrent.futures import ThreadPoolExecutor
from bs4 import BeautifulSoup
import pandas
import requests
PLATFORM = ['xbl', 'psn', 'atvi', 'battlenet']
REGION = ['us', 'ca']
PAGE_LIMIT = True
def leaderboardLister():
global REGION
global PLATFORM
list_url = []
for region in REGION:
for platform in PLATFORM:
for i in range(1,750):
list_url.append('https://cod.tracker.gg/warzone/leaderboards/battle-royale/' + platform + '/KdRatio?country=' + region + '&page=' + str(i))
leaderboardExecutor(list_url,30)
def leaderboardExecutor(urls,threads):
global PAGE_LIMIT
global INTERNET
if len(urls) > 0:
with ThreadPoolExecutor(max_workers=threads) as executor:
while True:
if PAGE_LIMIT == False:
executor.shutdown(cancel_futures=True)
while INTERNET == False:
try:
print('bad internet')
requests.get("http://google.com")
INTERNET = True
except:
time.sleep(3)
print('waited')
executor.map(scrapeLeaderboardPage, urls)
def scrapeLeaderboardPage(url):
global PAGE_LIMIT
checkInternet()
try:
page = requests.get(url)
soup = BeautifulSoup(page.content,features = 'lxml')
table_rows = soup.find_all('tr')
if len(table_rows) == 0:
PAGE_LIMIT = False
print(url)
else:
pass
print('success')
except:
INTERNET = False
leaderboardLister()

python making this function faster

i've been struggling for some time right now trying to find how to make this faster somehow
the code
def get_resellers(sellerid, price, userassetid, id):
data = {"expectedCurrency":1,"expectedPrice":price, "expectedSellerId":sellerid,"userAssetId":userassetid}
headers = {"X-CSRF-TOKEN":csrftoken}
k = requests.post(f"https://economy.roblox.com/v1/purchases/products/{id}", data=data, headers=headers, cookies=cookies)
def check_price(id):
while True:
try:
t0 = time.time()
soup = BeautifulSoup(requests.get(f"https://www.roblox.com/catalog/{id}").content, 'html.parser')
data_expected_price, data_expected_seller_id, data_userasset_id = soup.select_one('[data-expected-price]')['data-expected-price'], soup.select_one('[data-expected-seller-id]')['data-expected-seller-id'], soup.select_one('[data-lowest-private-sale-userasset-id]')['data-lowest-private-sale-userasset-id']
t1 = time.time()
total = t1 - t0
print(total)
if int(data_expected_price) < 0.7*int(data_expected_price):
get_resellers(data_expected_seller_id, data_expected_price, data_userasset_id, id)
except:
pass
is there any faster way to do it or extract the stuff, or make the http request etc. anything can help!
also: it takes like 0.7 seconds to buy and price check since it needs to load the site everytime is there anyway to do it faster?

Python: Pinging a URL multiple times at once for testing

I have a link that I want to test for robustness, for lack of a better word. What I have code that pings the URL multiple times, sequentially:
# Testing for robustness
for i in range(100000):
city = 'New York'
city = '%20'.join(city.split(' '))
res = requests.get(f'http://example.com/twofishes?query={city}')
data = res.json()
geo = data['interpretations'][0]['feature']['geometry']['center']
print('pinging xtime: %s ' % str(i))
print(geo['lat'], geo['lng'])
I want to take this code, but ping the link say, 10 or 12 times at once. I don't mind the sequential pinging, but that's not as efficient as pinging multiple times at once. I feel like this is a quick modification, where the for loop comes out and a PULL function goes in?
Here is an example program which should work for this task. Given that I do not want to be blacklisted, I have not actually tested the code to see if it works. Regardless, it should at least be in the ballpark of what your looking for. If you want actually have all of the threads execute at the same time I would look into adding events. Hope this helps.
Code
import threading
import requests
import requests.exceptions as exceptions
def stress_test(s):
for i in range(100000):
try:
city = 'New York'
city = '%20'.join(city.split(' '))
res = s.get(f'http://example.com/twofishes?query={city}')
data = res.json()
geo = data['interpretations'][0]['feature']['geometry']['center']
print('pinging xtime: %s ' % str(i))
print(geo['lat'], geo['lng'])
except (exceptions.ConnectionError, exceptions.HTTPError, exceptions.Timeout):
pass
if __name__ == '__main__':
for i in range(1, 12):
s = requests.session()
t = threading.Thread(target=stress_test, args=(s,))
t.start()
for th in threading.enumerate():
if th != threading.current_thread():
th.join()

Mutiprocessing doesn't work in python web-scraping

I have done with web-scraping using beautifulsoup and successfully save the parsed data into csv files but I want to speed up the process so I use multiprocessing. But there is no difference after I apply multiprocessing in the script. Here is my code
rootPath = '....'
urlp1 = "https://www.proteinatlas.org/"
try:
df1 = pd.read_csv(rootPath + "cancer_list1_2(1).csv", header=0);
except Exception as e:
print("File " + f + " doesn't exist")
print(str(e))
sys.exit()
cancer_list = df1.as_matrix().tolist()
# [["bcla_gene","beast+cancer"], ...]
URLs = []
for cancer in cancer_list:
urlp2 = "/pathology/tissue/" + cancer[1]
f = cancer[0]
try:
df1 = pd.read_csv(rootPath + f + ".csv", header=0);
except Exception as e:
print("File " + f + " doesn't exist")
print(str(e))
sys.exit()
... # list of urls
def scrape(url,output_path):
page = urlopen(URL)
soup = BeautifulSoup(page, 'html.parser')
item_text = soup.select('#scatter6001 script')[0].text
table = soup.find_all('table',{'class':'noborder dark'})
df1 = pd.read_html(str(table),header = 0)
df1 = pd.DataFrame(df1[0])
Number = soup.find('th',text = "Number of samples").find_next_sibling("td").text
...
#function of scraping
if __name__ == "__main__":
Parallel(n_jobs=-1)(scrape(url,output_path) for url in URLs)
Just update the code and the problem now is the CPU utilization can reach 100% only at beginning but soon will drop to 1%. I'm quite confused about that.
Without going to any details in your code: You may benefit from having a look at the joblib module.
Pseudocode:
import joblib
if __name__ == "__main__":
URLs = ["URL1", "URL2", "URL2", ...]
Parallel(n_jobs=-1)(scrape(url,output_path) for url in URLs)
Refactoring your code may be necessary because joblib only works if no code runs outside any def: and if __name__ == "__main__":-branch.
n_jobs=-1 will start a number of processes equivalent to the number of cores on your machine. For further details, refer to joblib's documentation.
Using this approach together with selenium/geckodriver, it is possible scrape a pool of 10k URLs in less than an hour depending on your machine (I usually open 40-50 processes on a octacore machine with 64GB ram).

Categories

Resources