I have a link that I want to test for robustness, for lack of a better word. What I have code that pings the URL multiple times, sequentially:
# Testing for robustness
for i in range(100000):
city = 'New York'
city = '%20'.join(city.split(' '))
res = requests.get(f'http://example.com/twofishes?query={city}')
data = res.json()
geo = data['interpretations'][0]['feature']['geometry']['center']
print('pinging xtime: %s ' % str(i))
print(geo['lat'], geo['lng'])
I want to take this code, but ping the link say, 10 or 12 times at once. I don't mind the sequential pinging, but that's not as efficient as pinging multiple times at once. I feel like this is a quick modification, where the for loop comes out and a PULL function goes in?
Here is an example program which should work for this task. Given that I do not want to be blacklisted, I have not actually tested the code to see if it works. Regardless, it should at least be in the ballpark of what your looking for. If you want actually have all of the threads execute at the same time I would look into adding events. Hope this helps.
Code
import threading
import requests
import requests.exceptions as exceptions
def stress_test(s):
for i in range(100000):
try:
city = 'New York'
city = '%20'.join(city.split(' '))
res = s.get(f'http://example.com/twofishes?query={city}')
data = res.json()
geo = data['interpretations'][0]['feature']['geometry']['center']
print('pinging xtime: %s ' % str(i))
print(geo['lat'], geo['lng'])
except (exceptions.ConnectionError, exceptions.HTTPError, exceptions.Timeout):
pass
if __name__ == '__main__':
for i in range(1, 12):
s = requests.session()
t = threading.Thread(target=stress_test, args=(s,))
t.start()
for th in threading.enumerate():
if th != threading.current_thread():
th.join()
Related
I have list of author names (more than 1k) and my google api quote limit is 20k I want to pass the author name into the API to get book informations. When I tested my code I got "429 Client Error: Too Many Requests for url..." error, How can I slow down my running time without stopping the application. (I'm using Python in google colab )
author_List = ["J. K. Rowling", "mark twain","Emily Dickinson"]
connGoogleAPI(author_List)
def connGoogleAPI(booksData):
key= "**************************"
books_list = []
col= ['Title', 'Authors', 'published Date', 'Description','ISBN']
books_list.append(col)
res = ""
err = None
with requests.Session() as session:
#err= ""
for Authors in booksData:
params = {"q": Authors,"key": key,"maxResults": 1}
delays = 65 # approximately 1 minute total delay time for any given author
while True:
try:
#do something
except Exception as e:
if err.status_code == 429:
#print("******")
if delays <= 0:
raise(e) # we've spent too long delaying
time.sleep(1)
delays -= 1
else:
print("-----=")
raise(e) # some other status code
books_list.append(lookup(res,Authors))
return books_list
you can import time and then add:
time.sleep(1)
at the end of your for loop, to pause for a second between each iteration.
You could slow down your for loop like this:
first, you need to import time
delay = 2
for key in author_List:
author = author_List[key]
newList.append(searchData(author))
time.sleep(delay)
The delay you can set a number how many seconds the loop gets delayed (here it would be 2 seconds)
You probably don't want unconditional delays which will slow down your processing unnecessarily. On the other hand, if you start getting HTTP 429 you can't be sure when or even if the server is going to allow you to continue. So you need a strategy that only introduces delays when/if required but also doesn't get into an infinite loop. Consider this:
import requests
import time
listofauthors = ['Mark Twain', 'Dan Brown', 'William Shakespeare']
with requests.Session() as session:
for author in listofauthors:
params = {'q': author}
delays = 60 # approximately 1 minute total delay time for any given author
while True:
try:
r = session.get('https://www.googleapis.com/books/v1/volumes', params=params)
r.raise_for_status()
print(r.json())
break # all good :-)
except Exception as e:
if r.status_code == 429:
if delays <= 0:
raise(e) # we've spent too long delaying
time.sleep(1)
delays -= 1
else:
raise(e) # some other status code
I'm currently trying to learn web scraping and decided to scrape some discord data. Code follows:
import requests
import json
def retrieve_messages(channelid):
num=0
headers = {
'authorization': 'here we enter the authorization code'
}
r = requests.get(
f'https://discord.com/api/v9/channels/{channelid}/messages?limit=100',headers=headers
)
jsonn = json.loads(r.text)
for value in jsonn:
print(value['content'], '\n')
num=num+1
print('number of messages we collected is',num)
retrieve_messages('server id goes here')
The problem: when I tried changing the limit here messages?limit=100 apparently it only accepts numbers between 0 and 100, meaning that the maximum number of messages I can get is 100. I tried changing this number to 900, for example, to scrape more messages. But then I get the error TypeError: string indices must be integers.
Any ideas on how I could get, possibly, all the messages in a channel?
Thank you very much for reading!
APIs that return a bunch of records are almost always limited to some number of items.
Otherwise, if a large quantity of items is requested, the API may fail due to being out of memory.
For that purpose, most APIs implement pagination using limit, before and after parameters where:
limit: tells you how many messages to fetch
before: get messages before this message ID
after: get messages after this message ID
Discord API is no exception as the documentation tells us.
Here's how you do it:
First, you will need to query the data multiple times.
For that, you can use a while loop.
Make sure to add an if the condition that will prevent the loop from running indefinitely - I added a check whether there are any messages left.
while True:
# ... requests code
jsonn = json.loads(r.text)
if len(jsonn) == 0:
break
for value in jsonn:
print(value['content'], '\n')
num=num+1
Define a variable that has the last message that you fetched and save the last message id that you already printed
def retrieve_messages(channelid):
last_message_id = None
while True:
# ...
for value in jsonn:
print(value['content'], '\n')
last_message_id = value['id']
num=num+1
Now on the first run the last_message_id is None, and on subsequent requests it has the last message you printed.
Use that to build your query
while True:
query_parameters = f'limit={limit}'
if last_message_id is not None:
query_parameters += f'&before={last_message_id}'
r = requests.get(
f'https://discord.com/api/v9/channels/{channelid}/messages?{query_parameters}',headers=headers
)
# ...
Note: discord servers give you the latest message first, so you have to use the before parameter
Here's a fully working example of your code
import requests
import json
def retrieve_messages(channelid):
num = 0
limit = 10
headers = {
'authorization': 'auth header here'
}
last_message_id = None
while True:
query_parameters = f'limit={limit}'
if last_message_id is not None:
query_parameters += f'&before={last_message_id}'
r = requests.get(
f'https://discord.com/api/v9/channels/{channelid}/messages?{query_parameters}',headers=headers
)
jsonn = json.loads(r.text)
if len(jsonn) == 0:
break
for value in jsonn:
print(value['content'], '\n')
last_message_id = value['id']
num=num+1
print('number of messages we collected is',num)
retrieve_messages('server id here')
To answer this question, we must look at the discord API. Googling "discord api get messages" gets us the developer reference for the discord API. The particular endpoint you are using is documented here:
https://discord.com/developers/docs/resources/channel#get-channel-messages
The limit is documented here, along with the around, before, and after parameters. Using one of these parameters (most likely after) we can paginate the results.
In pseudocode, it would look something like this:
offset = 0
limit = 100
all_messages=[]
while True:
r = requests.get(
f'https://discord.com/api/v9/channels/{channelid}/messages?limit={limit}&after={offset}',headers=headers
)
all_messages.append(extract messages from response)
if (number of responses < limit):
break # We have reached the end of all the messages, exit the loop
else:
offset += limit
By the way, you will probably want to print(r.text) right after the response comes in so you can see what the response looks like. It will save a lot of confusion.
Here is my solution. Feedback is welcome as I'm newish to Python. Kindly provide me w/ credit/good-luck if using this. Thank you =)
import requests
CHANNELID = 'REPLACE_ME'
HEADERS = {'authorization': 'REPLACE_ME'}
LIMIT=100
all_messages = []
r = requests.get(f'https://discord.com/api/v9/channels/{CHANNELID}/messages?limit={LIMIT}',headers=HEADERS)
all_messages.extend(r.json())
print(f'len(r.json()) is {len(r.json())}','\n')
while len(r.json()) == LIMIT:
last_message_id = r.json()[-1].get('id')
r = requests.get(f'https://discord.com/api/v9/channels/{CHANNELID}/messages?limit={LIMIT}&before={last_message_id}',headers=HEADERS)
all_messages.extend(r.json())
print(f'len(r.json()) is {len(r.json())} and last_message_id is {last_message_id} and len(all_messages) is {len(all_messages)}')
I'm scraping hundreds of urls, each with a leaderboard of data I want, and the only difference between each url string is a 'platform','region', and lastly, the page number. There are only a few platforms and regions, but the page numbers change each day and I don't know how many there are. So that's the first function, I'm just creating lists of urls to be requested in parallel.
If I use page=1, then the result will contain 'table_rows > 0' in the last function. But around page=500, the requested url still pings back but very slowly and then it will show an error message, no leaderboard found, the last function will show 'table_rows == 0', etc. The problem is I need to get through the very last page and I want to do this quickly, hence the threadpoolexecutor - but I can't cancel all the threads or processes or whatever once PAGE_LIMIT is tripped. I threw the executor.shutdown(cancel_futures=True) just to kind of show what I'm looking for. If nobody can help me I'll miserably remove the parallelization and I'll scrape slowly, sadly, one url at a time...
Thanks
from concurrent.futures import ThreadPoolExecutor
from bs4 import BeautifulSoup
import pandas
import requests
PLATFORM = ['xbl', 'psn', 'atvi', 'battlenet']
REGION = ['us', 'ca']
PAGE_LIMIT = True
def leaderboardLister():
global REGION
global PLATFORM
list_url = []
for region in REGION:
for platform in PLATFORM:
for i in range(1,750):
list_url.append('https://cod.tracker.gg/warzone/leaderboards/battle-royale/' + platform + '/KdRatio?country=' + region + '&page=' + str(i))
leaderboardExecutor(list_url,30)
def leaderboardExecutor(urls,threads):
global PAGE_LIMIT
global INTERNET
if len(urls) > 0:
with ThreadPoolExecutor(max_workers=threads) as executor:
while True:
if PAGE_LIMIT == False:
executor.shutdown(cancel_futures=True)
while INTERNET == False:
try:
print('bad internet')
requests.get("http://google.com")
INTERNET = True
except:
time.sleep(3)
print('waited')
executor.map(scrapeLeaderboardPage, urls)
def scrapeLeaderboardPage(url):
global PAGE_LIMIT
checkInternet()
try:
page = requests.get(url)
soup = BeautifulSoup(page.content,features = 'lxml')
table_rows = soup.find_all('tr')
if len(table_rows) == 0:
PAGE_LIMIT = False
print(url)
else:
pass
print('success')
except:
INTERNET = False
leaderboardLister()
I have made a script which constructs a checkout URL for shopify websites. This is done by appending each unique product 'variant' ID in the checkout URL and then opening the said URL in a webbrowser. To find the variant ID, i need to parse the website's sitemap to obtain the ID, which I am currenly doing in seperate threads for each product i am parsing, however with each thread added the time it takes increases by quite a lot (nearly one second).
Why is this the case? Shouldn't it take around the same time since each thread basically does the same exact thing?
For reference, one thread takes around 2.0s, two threads 2.8s and three threads around 3.8s
Here is my code:
import time
import requests
from bs4 import BeautifulSoup
import webbrowser
import threading
sitemap2 = 'https://deadstock.ca/sitemap_products_1.xml'
atc_url = 'https://deadstock.ca/cart/'
# CHANGE SITEMAP TO THE CORRECT ONE (THE SITE YOU ARE SCRAPING)
variant_list = []
def add_to_cart(keywords, size):
init = time.time()
# Initialize session
product_url = ''
parse_session = requests.Session()
response = parse_session.get(sitemap2)
soup = BeautifulSoup(response.content, 'lxml')
variant_id = 0
# Find Item
for urls in soup.find_all('url'):
for images in urls.find_all('image:image'):
if all(i in images.find('image:title').text.lower() for i in keywords):
now = time.time()
product_name = images.find('image:title').text
print('FOUND: ' + product_name + ' - ' + str(format(now-init, '.3g')) + 's')
product_url = urls.find("loc").text
if product_url != '':
response1 = parse_session.get(product_url+".xml")
soup = BeautifulSoup(response1.content,'lxml')
for variants in soup.find_all('variant'):
if size in variants.find('title').text.lower():
variant_id = variants.find('id', type='integer').text
atc_link = str(variant_id)+':1'
print(atc_link)
variant_list.append(atc_link)
try:
print("PARSED PRODUCT: " + product_name)
except UnboundLocalError:
print("Retrying")
add_to_cart(keywords, size)
def open_checkout():
url = 'https://deadstock.ca/cart/'
for var in variant_list:
url = url + var + ','
webbrowser.open_new_tab(url)
# When initializing a new thread, only change the keywords in the args, and make sure you start and join the thread.
# Change sitemap in scraper.py to your websites' sitemap
# If the script finds multiple items, the first item will be opened so please try to be very specific yet accurate.
def main():
print("Starting Script")
init = time.time()
try:
t1 = threading.Thread(target=add_to_cart, args=(['alltimers','relations','t-shirt','white'],'s',))
t2 = threading.Thread(target=add_to_cart, args=(['alltimers', 'relations', 'maroon'],'s',))
t3 = threading.Thread(target=add_to_cart, args=(['brain', 'dead','melter'], 's',))
t1.start()
t2.start()
t3.start()
t1.join()
t2.join()
t3.join()
print(variant_list)
open_checkout()
except:
print("Product not found / not yet live. Retrying..")
main()
print("Time taken: " + str(time.time()-init))
if __name__ == '__main__':
main()
Question: ... one thread takes around 2.0s, two threads 2.8s and three threads around 3.8s
Regarding your example code, you are counting​ the sum of all threads.
As #asettouf pointed out, there is a overhead, mean you have to pay for it.
But I assume, doing this 3 tasks threaded will be faster as doing it one after the other.
I'm trying to run this code:
def VideoHandler(id):
try:
cursor = conn.cursor()
print "Doing {0}".format(id)
data = urllib2.urlopen("http://myblogfms2.fxp.co.il/video" + str(id) + "/").read()
title = re.search("<span class=\"style5\"><strong>([\\s\\S]+?)</strong></span>", data).group(1)
picture = re.search("#4F9EFF;\"><img src=\"(.+?)\" width=\"120\" height=\"90\"", data).group(1)
link = re.search("flashvars=\"([\\s\\S]+?)\" width=\"612\"", data).group(1)
id = id
print "Done with {0}".format(id)
cursor.execute("insert into videos (`title`, `picture`, `link`, `vid_id`) values('{0}', '{1}', '{2}', {3})".format(title, picture, link, id))
print "Added {0} to the database".format(id)
except:
pass
x = 1
while True:
if x != 945719:
currentX = x
thread.start_new_thread(VideoHandler, (currentX))
else:
break
x += 1
and it says "can't start new thread"
The real reason for the error is most likely that you create way too many threads (more than 100k!!!) and hit an OS-level limit.
Your code can be improved in many ways besides this:
don't use the low level thread module, use the Thread class in the threading module.
join the threads at the end of your code
limit the number of threads you create to something reasonable: to process all elements, create a small number of threads and let each one process a subset of the whole data (this is what I propose below, but you could also adopt a producer-consumer pattern with worker threads getting their data from a queue.Queue instance)
and never, ever have a except: pass statement in your code. Or
if you do, don't come crying here if your code does not work and you
cannot figure out why. :-)
Here's a proposal:
from threading import Thread
import urllib2
import re
def VideoHandler(id_list):
for id in id_list:
try:
cursor = conn.cursor()
print "Doing {0}".format(id)
data = urllib2.urlopen("http://myblogfms2.fxp.co.il/video" + str(id) + "/").read()
title = re.search("<span class=\"style5\"><strong>([\\s\\S]+?)</strong></span>", data).group(1)
picture = re.search("#4F9EFF;\"><img src=\"(.+?)\" width=\"120\" height=\"90\"", data).group(1)
link = re.search("flashvars=\"([\\s\\S]+?)\" width=\"612\"", data).group(1)
id = id
print "Done with {0}".format(id)
cursor.execute("insert into videos (`title`, `picture`, `link`, `vid_id`) values('{0}', '{1}', '{2}', {3})".format(title, picture, link, id))
print "Added {0} to the database".format(id)
except:
import traceback
traceback.print_exc()
conn = get_some_dbapi_connection()
threads = []
nb_threads = 8
max_id = 945718
for i in range(nb_threads):
id_range = range(i*max_id//nb_threads, (i+1)*max_id//nb_threads + 1)
thread = Thread(target=VideoHandler, args=(id_range,))
threads.append(thread)
thread.start()
for thread in threads:
thread.join() # wait for completion
os has a limit of the amount of threads. So you can't create too many threads over the limit.
ThreadPool should be a good choice for you the do this high concurrency work.