Speed up python web-scraping - python

I am writing this python script to scrap a website for collecting information. After I entered the date and the no. of game on that date, this script will help me to go to the site and scrap a specific table and save each game into a CSV file. There are 12 rows of the tables, so that's why you see I have hardcoded it in the for loop.
The script works but I would like to seek for the suggestion from you experts to optimize and speed up the script. I thought using concurrent would speed up but it doesn't give a obvious improvement.
It would be great if anyone can help. At the moment, this script would take 20-30 seconds to complete for one date scrap.
Thank you very much for your time!
import concurrent.futures
import pandas as pd
from requests_html import HTMLSession
import requests_cache
session = HTMLSession()
requests_cache.install_cache(expire_after=3600)
game_date = input("Please input the date of the game that you want to scrap (in YYYY/MM/DD): ")
game_no = int(input("Please input how many game on that date: "))
def split_list(big_list, chunk_size):
return [big_list[i:i + chunk_size] for i in range(0, len(big_list), chunk_size)]
def get_game_result(game):
print(f"Processing game {game}")
url = f"https://example.com{game_date}&{game}" <<< example link
response = session.get(url)
response.html.render(sleep=5, keep_page=True, scrolldown=1)
row_body = response.html.xpath(f"/html/body/div[1]/div[3]/div[2]/div[2]/div[2]/div[5]/table/tbody/tr[1]")
final_list = []
for i in range(2, 14):
for item in row_body:
item_table = item.text.split("\n")
final_list.append(item_table)
row_body = response.html.xpath(f"/html/body/div[1]/div[3]/div[2]/div[2]/div[2]/div[5]/table/tbody/tr[{i}]")
i += 1
df = pd.DataFrame(final_list)
df.to_csv(f"game_{game}.csv", index=False, header=False)
print(f"Finished processing game {game}")
with concurrent.futures.ThreadPoolExecutor(max_workers=10) as executor:
results = [executor.submit(get_game_result, game) for game in range(1, game_no + 1)]
for f in concurrent.futures.as_completed(results):
f.result()

Related

Can't Stop ThreadPoolExecutor

I'm scraping hundreds of urls, each with a leaderboard of data I want, and the only difference between each url string is a 'platform','region', and lastly, the page number. There are only a few platforms and regions, but the page numbers change each day and I don't know how many there are. So that's the first function, I'm just creating lists of urls to be requested in parallel.
If I use page=1, then the result will contain 'table_rows > 0' in the last function. But around page=500, the requested url still pings back but very slowly and then it will show an error message, no leaderboard found, the last function will show 'table_rows == 0', etc. The problem is I need to get through the very last page and I want to do this quickly, hence the threadpoolexecutor - but I can't cancel all the threads or processes or whatever once PAGE_LIMIT is tripped. I threw the executor.shutdown(cancel_futures=True) just to kind of show what I'm looking for. If nobody can help me I'll miserably remove the parallelization and I'll scrape slowly, sadly, one url at a time...
Thanks
from concurrent.futures import ThreadPoolExecutor
from bs4 import BeautifulSoup
import pandas
import requests
PLATFORM = ['xbl', 'psn', 'atvi', 'battlenet']
REGION = ['us', 'ca']
PAGE_LIMIT = True
def leaderboardLister():
global REGION
global PLATFORM
list_url = []
for region in REGION:
for platform in PLATFORM:
for i in range(1,750):
list_url.append('https://cod.tracker.gg/warzone/leaderboards/battle-royale/' + platform + '/KdRatio?country=' + region + '&page=' + str(i))
leaderboardExecutor(list_url,30)
def leaderboardExecutor(urls,threads):
global PAGE_LIMIT
global INTERNET
if len(urls) > 0:
with ThreadPoolExecutor(max_workers=threads) as executor:
while True:
if PAGE_LIMIT == False:
executor.shutdown(cancel_futures=True)
while INTERNET == False:
try:
print('bad internet')
requests.get("http://google.com")
INTERNET = True
except:
time.sleep(3)
print('waited')
executor.map(scrapeLeaderboardPage, urls)
def scrapeLeaderboardPage(url):
global PAGE_LIMIT
checkInternet()
try:
page = requests.get(url)
soup = BeautifulSoup(page.content,features = 'lxml')
table_rows = soup.find_all('tr')
if len(table_rows) == 0:
PAGE_LIMIT = False
print(url)
else:
pass
print('success')
except:
INTERNET = False
leaderboardLister()

Get page loading time with Python

I'm trying to get the time it takes a page to fully load (like the Finish in Google's Chrome Dev Tools). I wrote something in Python but I get really low time results, way less than a second which is not realistic. This is what I have:
from urllib.request import urlopen
from time import time
class Webpage:
def __init__(self, pageName, pageUrl):
self.pageName = pageName
self.pageUrl = pageUrl
class LoadingDetail:
def __init__(self, webPage, loadingTimes): #Getting the webpage object, and it's loading times
self.webPage = webPage
self.loadingTimes = loadingTimes
pages = [
Webpage("test", "URL"),
Webpage("test2", "URL"),
Webpage("test3", "URL"),
Webpage("test4", "URL"),
]
loadingDeatils = []
for page in pages: #Going through each page in the array.
pageLoadTimes = [] #Storing the time it took the page to load.
for x in range(0, 3): #Number of times we request each page.
stream = urlopen(page.pageUrl)
startTime = time()
streamRead = stream.read()
endTime = time()
stream.close()
timeToLoad = endTime - startTime #The time it took to read the whole page.
pageLoadTimes.append(timeToLoad)
loadDetails = LoadingDetail(page, pageLoadTimes)
loadingDeatils.append(loadDetails)
I get results like 0.00011. I searched but only found Selenium based answers which I can't use.
Is there an option to do it which Python only? is Python the right tool for this? I saw answers with JS that seems to be exactly what I was looking for.
I tried it and was able to measure the time with the below code, this might help
reference: geeksforgeeks.org/timeit-python-examples/
import timeit
mysetup = "from urllib.request import urlopen"
mycode = '''
urlopen('http://www.python.org')
'''
print(timeit.timeit(setup = mysetup, stmt = mycode, number = 1))

Python: Pinging a URL multiple times at once for testing

I have a link that I want to test for robustness, for lack of a better word. What I have code that pings the URL multiple times, sequentially:
# Testing for robustness
for i in range(100000):
city = 'New York'
city = '%20'.join(city.split(' '))
res = requests.get(f'http://example.com/twofishes?query={city}')
data = res.json()
geo = data['interpretations'][0]['feature']['geometry']['center']
print('pinging xtime: %s ' % str(i))
print(geo['lat'], geo['lng'])
I want to take this code, but ping the link say, 10 or 12 times at once. I don't mind the sequential pinging, but that's not as efficient as pinging multiple times at once. I feel like this is a quick modification, where the for loop comes out and a PULL function goes in?
Here is an example program which should work for this task. Given that I do not want to be blacklisted, I have not actually tested the code to see if it works. Regardless, it should at least be in the ballpark of what your looking for. If you want actually have all of the threads execute at the same time I would look into adding events. Hope this helps.
Code
import threading
import requests
import requests.exceptions as exceptions
def stress_test(s):
for i in range(100000):
try:
city = 'New York'
city = '%20'.join(city.split(' '))
res = s.get(f'http://example.com/twofishes?query={city}')
data = res.json()
geo = data['interpretations'][0]['feature']['geometry']['center']
print('pinging xtime: %s ' % str(i))
print(geo['lat'], geo['lng'])
except (exceptions.ConnectionError, exceptions.HTTPError, exceptions.Timeout):
pass
if __name__ == '__main__':
for i in range(1, 12):
s = requests.session()
t = threading.Thread(target=stress_test, args=(s,))
t.start()
for th in threading.enumerate():
if th != threading.current_thread():
th.join()

python: changing a variable to rerun the script in a infinite loop

Basicly I'm trying to take a starting variable run it through my code.
Then add 1 to the variable and rerun the code and repeat till i stop it. Now i have wrote this up but its pretty off from what I'm trying to accomplish. Am i on the right path or completely off?
CODE -
import requests
from bs4 import BeautifulSoup as bs
import re
start = 1
print("starting number: ", start)
i = 0
number = 500
while i < number:
url = "https://randomsite.com/{0}".format(start + i)
try:
print(int(start) + i)
response1 = requests.get(url)
name = re.findall(('analyticsKey":"([^"]+)'), response1.text)
ids = re.findall(('id":"([^"]+)'), response1.text)
print(name)
print("")
print(ids)
print("")
except:
pass
i += 1

Simple web crawler very slow

I have built a very simple web crawler to crawl ~100 small json files in the URL below. The issue is that the crawler takes more than an hour to complete. I find that hard to understand given how small the json files are. Am I doing something fundamentally wrong here?
def get_senate_vote(vote):
URL = 'https://www.govtrack.us/data/congress/113/votes/2013/s%d/data.json' % vote
response = requests.get(URL)
json_data = json.loads(response.text)
return json_data
def get_all_votes():
all_senate_votes = []
URL = "http://www.govtrack.us/data/congress/113/votes/2013"
response = requests.get(URL)
root = html.fromstring(response.content)
for a in root.xpath('/html/body/pre/a'):
link = a.xpath('text()')[0].strip()
if link[0] == 's':
vote = int(link[1:-1])
try:
vote_json = get_senate_vote(vote)
except:
return all_senate_votes
all_senate_votes.append(vote_json)
return all_senate_votes
vote_data = get_all_votes()
Here is a rather simple code sample, I've calculated the time taken for each call. On my system its taking on an average 2 secs per request, and there are 582 pages to visit, so around 19 mins without printing the JSON to the console. In your case network time plus print time may increase it.
#!/usr/bin/python
import requests
import re
import time
def find_votes():
r=requests.get("https://www.govtrack.us/data/congress/113/votes/2013/")
data = r.text
votes = re.findall('s\d+',data)
return votes
def crawl_data(votes):
print("Total pages: "+str(len(votes)))
for x in votes:
url ='https://www.govtrack.us/data/congress/113/votes/2013/'+x+'/data.json'
t1=time.time()
r=requests.get(url)
json = r.json()
print(time.time()-t1)
crawl_data(find_votes())
If you are using python 3.x and you are crawling multiple sites, for even better performances I offer warmly to you to use the aiohttp module, which implements the asynchronous principles.
For example:
import aiohttp
import asyncio
sites = ['url_1', 'url_2']
results = []
def save_reponse(result):
site_content = result.result()
results.append(site_content)
async def crawl_site(site):
async with aiohttp.ClientSession() as session:
async with session.get(site) as resp:
resp = await resp.text()
return resp
tasks = []
for site in sites:
task = asyncio.ensure_future(crawl_site(site))
task.add_done_callback(save_reponse)
tasks.append(task)
all_tasks = asyncio.gather(*tasks)
loop = asyncio.get_event_loop()
loop.run_until_complete(all_tasks)
loop.close()
print(results)
For more reading about aiohttp.

Categories

Resources