How to download dynamically loaded images using python and seleniumwire?

How to download dynamically loaded images using python and seleniumwire? - python

First of all I should inform you that I have very little experience in programming. And I have some trouble with the logic and flow of a general webscraper implemented in python. I assume that I should use callbacks and similar methods in order to properly control the process of saving pages from a javascript e-book reader. My script does work, but not consistently. If someone could advice me on improvements that should be made to this script, that would be great. Thank you.
from seleniumwire.utils import decode as sdecode
from selenium.webdriver.support.select import Select
from selenium.webdriver.common.keys import Keys
from selenium.webdriver.common.by import By
from selenium.webdriver.chrome.options import Options # [!]
from selenium.webdriver.common.desired_capabilities import DesiredCapabilities
import time
import os.path
opts = Options() # [!]
opts.add_experimental_option('w3c', True) # [!]
capabilities = DesiredCapabilities.CHROME.copy()
driver = webdriver.Chrome(chrome_options=opts, desired_capabilities=capabilities)
url = ' here comes url'
driver.get(url)
def get_requests():
l = []
for rx in driver.requests:
#endmark = '&scale=2&rotate=0' lenght must be 17
if rx.url[-17:]==endmark:
l.append(rx.url)
return list(set(l))
def savepages(diff):
newpages = 0
for urlitem in diff:
for request in driver.requests:
if request.url==urlitem:
#print(request.url)
ind = urlitem.find('.jp2&id') # ex. 0012.jp2&id
file_path = directory_path + '\\' + file_name + urlitem[ind-4:ind] + '.jpg'
tik = 0
while tik<10: #waiting for the response body data
try:
tik += 1
data = sdecode(request.response.body, request.response.headers.get('Content-Encoding', 'identity'))
except AttributeError: # no data error
time.sleep(2) # wait for 2 sec for the data
continue
#data = data.decode("utf-8",'ignore')
# sometimes I get this error 'UnboundLocalError: local variable 'data' referenced before assignment'
# I assumed that the following condition will help but it doesn't seem to work consistently
if data:
with open(file_path, 'wb') as outfile:
outfile.write(data) # sometimes I get UnboundLocalError
else: print('no data')
# was the file saved or not
if os.path.exists(file_path):
newpages += 1 # smth is wrong with the counting logic, since pages+newpages should be equal to the lenght of li=get_requests(), I get more
else:
time.sleep(.5)
return newpages
count = 0 # a counter, should terminate the main delay loop
pages = 0 # counting all saved pages; book pages or images are equivalent, one turn should open 2 new pages/images/requests
oldli = [] #compare to the new list after each delay cycle
turns = 0 #count how many turns have been made or how many times we clicked on the button Next Page
li = get_requests() # get all unique requests of the images/pages, some requests might be still loading, but we manually opened the first page and visually confirmed that there are at least 1 or 3 images/requests
if li: # the program STARTS HERE, first try, there are some requests because we manually opened the first page
# THE MAIN CYCLE should stop when the delay is too long and we turned all the pages of the book
while 2*turns+1<len(li) or count<15: # should terminate the whole program when there is no more images coming
count = 0 #reset counter
success = False #reset success; new pages downloaded successfully
# the main delay counter
# what happens if diff is [] and no success
while True:
count += 1
if count > 14:
print('Time out after more than 10 seconds.')
break
li = get_requests() # in addition, I assume that all requests counting from page 1 will be kept
# it is possible that li will not have some of the old requests and oldli will be longer
# well, I need to keep all old requests in a separate list and then append to it
diff = list(set(li)-set(oldli)) # find new requests after the delay
if diff: # there are some new
npages = savepages(diff) # saves new images and returns the number of them
print('newpages ',npages, ' len diff ', len(diff)) # should be equal
if npages >= len(diff)-1: # we allow one request without a body with data ??
pages += npages # smth is not ok here, the number of pages sometimes exceeds the length of li
success = True # we call it a success
else:
print('Could not save pages. Newpages ', npages, ' len diff ', len(diff))
for pg in diff:
print(pg) # for debuging purposes
break # in this case you break from the delay cycle
else: time.sleep(2) # if no new requests add 2 sec to the waiting time
if success: # we turn pages in case of successful download, this is bad if we need to catch up
while 2*turns+1 < len(li): # if some of old requests are deleted then the program will stop earlier
# it won't wait for the bodies of requests, there is a problem
driver.find_elements(By.CLASS_NAME, "BRicon.book_right.book_flip_next")[0].click()
turns += 1
time.sleep(3) # I got the impression that this doesn't happen
oldli = li
print('pages ',pages,' length of list ',len(li))
break # we break from the delay cycle since success
time.sleep(2) # the main delay timer;; plus no diff timer = total time
else: print('no requests in the list to process') ```

Related

StaleElementReferenceException | WebDriver losing reference after function call

I am trying to scrape products listed on https://www.ethicon.com/. My approach is to start with
scraping product links from product list page
find all variant link pages from category pages (From Product Specifications in sample)
extract relevant details from the variant page
I am testing with moving from 2 -> 3 as of now. I am trying with
this code (full code)
def fetch_productlinks(self, url):
self.driver.get(url)
elems = self.driver.find_elements_by_xpath("//a[#href]")
# print(elems[0].get_attribute("href"))
counter = 0
for elem in elems:
elem_url = elem.get_attribute("href")
if re.match(".*/code/.*", elem_url):
# print(elem_url)
if counter <= 1:
self.extract_imagesandmetadata(elem_url, self.driver)
counter += 1
def extract_imagesandmetadata(self, url, driver):
# driver.find_element_by_tag_name('body').send_keys(Keys.COMMAND + 't')
before_window = driver.window_handles[0]
driver.find_element_by_tag_name('body').send_keys(Keys.COMMAND + 't')
driver.get(url)
after_window = driver.window_handles[1]
driver.switch_to.window(after_window)
print("Crawling ..." + url)
html = driver.page_source
if html:
self.soup = BeautifulSoup(html, 'html.parser')
pimg = self.soup.find('img', {'class': 'img-responsive'})
if pimg:
print(pimg["src"])
self.tempdict["pimg"] = self.img_base_path + pimg["src"]
else:
self.tempdict["pimg"] = ""
ptitle = self.soup.find('h1', {'class': 'eprc-title'})
if ptitle:
print(self.sanitize_text(ptitle.text))
self.tempdict["ptitle"] = self.sanitize_text(ptitle.text)
else:
self.tempdict["ptitle"] = ""
self.tempdict["purl"] = url
self.outdict.append(self.tempdict)
driver.switch_to.window(before_window)
and getting below error
selenium.common.exceptions.StaleElementReferenceException: Message:
stale element reference: element is not attached to the page document
which I believe I am getting because the webdriver is losing the reference after the function call. I am calling fetch_productlinks from the main function.
What can I do to resolve this?

The basics of selenium are that you locate an element by one of selenium's locators. One assumes once you have the element you can interact with it, and life is easy. Typically it's not, because any interaction with the page, can cause the DOM to re-render making your element something that no longer exists. That is simplified explanation, it can get way worse, but we will stick with this simple concept for now.
You need to accept that the DOM has the potential to re-render at the most inopportune times. Once you accept this, you start to think in terms of how to recover and what is the acceptable limits of recovery.
Below I have some pseudo code to give you an idea of how to recover.
In the pseudo code I have two classes which semi-mimic your troubled code section.
The first class is called PseudoClass1:
What PseudoClass1 does is just simply save your data to memory and if we manage to receive no errors, move that data to its desired location. This assumes memory usage is not a problem.
The second class is called PseudoClass2:
What PseudoClass2 does is just simply save your data to a temporary directory and if we manage to receive no errors, move that data to its desired location. This assumes memory usage is a problem, so we write to a temp location then move it to the final destination.
import os
import re
import tempfile
from selenium import webdriver
from selenium.common.exceptions import StaleElementReferenceException
# We are assuming the webdriver is right next to this file
CHROME_DRIVER_PATH = os.path.join(os.path.dirname(os.path.abspath(__file__)), 'chromedriver')
class PseudoClass1:
def __init__(self):
self.driver = webdriver.Chrome(executable_path=CHROME_DRIVER_PATH)
self.cache = {}
def save_images_plus_meta_into_memory(self, url):
"""Does some work and saves whatever you want into self.cache"""
def move_images_plus_meta_into_desired_location(self):
"""Take whatever has been safely processed in self.cache and move it to the desired location"""
def fetch_productlinks_saved_in_memory(self, url):
"""If you have no memory constraints save the values into memory"""
self.driver.get(url)
retries = 1
max_retries = 10
while retries <= max_retries:
self.cache = {}
retries += 1
try:
elems = self.driver.find_elements_by_xpath("//a[#href]")
counter = 0
for elem in elems:
elem_url = elem.get_attribute("href")
if re.match(".*/code/.*", elem_url):
if counter <= 1:
self.save_images_plus_meta_into_memory(elem_url)
counter += 1
# If are here, you did not get a stale element reference
# You can now transfer data to the desired place
self.move_images_plus_meta_into_desired_location()
return # Exit out of the retry loop
except StaleElementReferenceException:
print(f'Retry count: {retries}')
continue
raise Exception('Failed to fetch product links')
class PseudoClass2:
def __init__(self):
self.driver = webdriver.Chrome(executable_path=CHROME_DRIVER_PATH)
def save_images_plus_meta_into_a_temp_dir(self, url, temp_dir):
"""Does some work and saves whatever you want into the temp_dir location"""
def move_images_plus_meta_into_desired_location(self, temp_dir):
"""Take whatever has been safely processed in temp_dir and move it to the desired location"""
def fetch_productlinks_saved_in_a_temp_directory(self, url):
self.driver.get(url)
retries = 1
max_retries = 10
while retries <= max_retries:
retries += 1
try:
# creates a temporary directory to write stuff to
with tempfile.TemporaryDirectory() as temp_dir:
elems = self.driver.find_elements_by_xpath("//a[#href]")
counter = 0
for elem in elems:
elem_url = elem.get_attribute("href")
if re.match(".*/code/.*", elem_url):
if counter <= 1:
self.save_images_plus_meta_into_a_temp_dir(elem_url, temp_dir)
counter += 1
# If are here, you did not get a stale element reference
# You can now transfer data to the desired place
self.move_images_plus_meta_into_desired_location(temp_dir)
return # Exit out of the retry loop
except StaleElementReferenceException:
print(f'Retry count: {retries}')
continue
raise Exception('Failed to fetch product links')

Can't Stop ThreadPoolExecutor

I'm scraping hundreds of urls, each with a leaderboard of data I want, and the only difference between each url string is a 'platform','region', and lastly, the page number. There are only a few platforms and regions, but the page numbers change each day and I don't know how many there are. So that's the first function, I'm just creating lists of urls to be requested in parallel.
If I use page=1, then the result will contain 'table_rows > 0' in the last function. But around page=500, the requested url still pings back but very slowly and then it will show an error message, no leaderboard found, the last function will show 'table_rows == 0', etc. The problem is I need to get through the very last page and I want to do this quickly, hence the threadpoolexecutor - but I can't cancel all the threads or processes or whatever once PAGE_LIMIT is tripped. I threw the executor.shutdown(cancel_futures=True) just to kind of show what I'm looking for. If nobody can help me I'll miserably remove the parallelization and I'll scrape slowly, sadly, one url at a time...
Thanks
from concurrent.futures import ThreadPoolExecutor
from bs4 import BeautifulSoup
import pandas
import requests
PLATFORM = ['xbl', 'psn', 'atvi', 'battlenet']
REGION = ['us', 'ca']
PAGE_LIMIT = True
def leaderboardLister():
global REGION
global PLATFORM
list_url = []
for region in REGION:
for platform in PLATFORM:
for i in range(1,750):
list_url.append('https://cod.tracker.gg/warzone/leaderboards/battle-royale/' + platform + '/KdRatio?country=' + region + '&page=' + str(i))
leaderboardExecutor(list_url,30)
def leaderboardExecutor(urls,threads):
global PAGE_LIMIT
global INTERNET
if len(urls) > 0:
with ThreadPoolExecutor(max_workers=threads) as executor:
while True:
if PAGE_LIMIT == False:
executor.shutdown(cancel_futures=True)
while INTERNET == False:
try:
print('bad internet')
requests.get("http://google.com")
INTERNET = True
except:
time.sleep(3)
print('waited')
executor.map(scrapeLeaderboardPage, urls)
def scrapeLeaderboardPage(url):
global PAGE_LIMIT
checkInternet()
try:
page = requests.get(url)
soup = BeautifulSoup(page.content,features = 'lxml')
table_rows = soup.find_all('tr')
if len(table_rows) == 0:
PAGE_LIMIT = False
print(url)
else:
pass
print('success')
except:
INTERNET = False
leaderboardLister()

Can I pause a scroll function in selenium, scrape the current data, and then continue scrolling later in the script?

I am a student working on a scraping project and I am having trouble completing my script because it fills my computer's memory with all of the data is stores.
It currently stores all of my data until the end, so my solution to this would be to break up the scrape into smaller bits and then write out the data periodically so it does not just continue to make one big list and then write out at the end.
In order to do this, I would need to stop my scroll method, scrape the loaded profiles, write out the data that I have collected, and then repeat this process without duplicating my data. It would be appreciated if someone could show me how to do this. Thank you for your help :)
Here's my current code:
from selenium import webdriver
from selenium.webdriver.common.keys import Keys
from time import sleep
from selenium.common.exceptions import NoSuchElementException
Data = []
driver = webdriver.Chrome()
driver.get("https://directory.bcsp.org/")
count = int(input("Number of Pages to Scrape: "))
body = driver.find_element_by_xpath("//body")
profile_count = driver.find_elements_by_xpath("//div[#align='right']/a")
while len(profile_count) < count: # Get links up to "count"
body.send_keys(Keys.END)
sleep(1)
profile_count = driver.find_elements_by_xpath("//div[#align='right']/a")
for link in profile_count: # Calling up links
temp = link.get_attribute('href') # temp for
driver.execute_script("window.open('');") # open new tab
driver.switch_to.window(driver.window_handles[1]) # focus new tab
driver.get(temp)
# scrape code
Name = driver.find_element_by_xpath('/html/body/table/tbody/tr/td/table/tbody/tr/td[5]/div/table[1]/tbody/tr/td[1]/div[2]/div').text
IssuedBy = "Board of Certified Safety Professionals"
CertificationorDesignaationNumber = driver.find_element_by_xpath('/html/body/table/tbody/tr/td/table/tbody/tr/td[5]/div/table[1]/tbody/tr/td[3]/table/tbody/tr[1]/td[3]/div[2]').text
CertfiedorDesignatedSince = driver.find_element_by_xpath('/html/body/table/tbody/tr/td/table/tbody/tr/td[5]/div/table[1]/tbody/tr/td[3]/table/tbody/tr[3]/td[1]/div[2]').text
try:
AccreditedBy = driver.find_element_by_xpath('/html/body/table/tbody/tr/td/table/tbody/tr/td[5]/div/table[1]/tbody/tr/td[3]/table/tbody/tr[5]/td[3]/div[2]/a').text
except NoSuchElementException:
AccreditedBy = "N/A"
try:
Expires = driver.find_element_by_xpath('/html/body/table/tbody/tr/td/table/tbody/tr/td[5]/div/table[1]/tbody/tr/td[3]/table/tbody/tr[5]/td[1]/div[2]').text
except NoSuchElementException:
Expires = "N/A"
info = Name, IssuedBy, CertificationorDesignaationNumber, CertfiedorDesignatedSince, AccreditedBy, Expires + "\n"
Data.extend(info)
driver.close()
driver.switch_to.window(driver.window_handles[0])
with open("Spredsheet.txt", "w") as output:
output.write(','.join(Data))
driver.close()
Test.py
Displaying Test.py.

Try the below approach using requests and beautifulsoup. In the below script i have used the API URL fetched from website itself for ex:-API URL
First it will create the URL(refer first url) for first iteration, add headers and data in .csv file.
Second iteration it will again create the URL(refer second url) with 2 extra params start_on_page=20 & show_per_page=20 where start_on_page number 20 is incremented by 20 on each iteration and show_per_page = 100 defaulted to extract 100 records per iteration so on till all the data dumped in to the .csv file.second iteration API URL
Script is dumping 4 things number, name, location and profile url.
On each iteration data will be appended to .csv file , so your memory issue will get resolved by this approach.
Do not forget to add your system path in file_path variable where do you want to create .csv file before running the script.
import requests
from urllib3.exceptions import InsecureRequestWarning
requests.packages.urllib3.disable_warnings(InsecureRequestWarning)
from bs4 import BeautifulSoup as bs
import csv
def scrap_directory_data():
list_of_credentials = []
file_path = ''
file_name = 'credential_list.csv'
count = 0
page_number = 0
page_size = 100
create_url = ''
main_url = 'https://directory.bcsp.org/search_results.php?'
first_iteration_url = 'first_name=&last_name=&city=&state=&country=&certification=&unauthorized=0&retired=0&specialties=&industries='
number_of_records = 0
csv_headers = ['#','Name','Location','Profile URL']
while True:
if count == 0:
create_url = main_url + first_iteration_url
print('-' * 100)
print('1 iteration URL created: ' + create_url)
print('-' * 100)
else:
create_url = main_url + 'start_on_page=' + str(page_number) + '&show_per_page=' + str(page_size) + '&' + first_iteration_url
print('-' * 100)
print('Other then first iteration URL created: ' + create_url)
print('-' * 100)
page = requests.get(create_url,verify=False)
extracted_text = bs(page.text, 'lxml')
result = extracted_text.find_all('tr')
if len(result) > 0:
for idx, data in enumerate(result):
if idx > 0:
number_of_records +=1
name = data.contents[1].text
location = data.contents[3].text
profile_url = data.contents[5].contents[0].attrs['href']
list_of_credentials.append({
'#':number_of_records,
'Name':name,
'Location': location,
'Profile URL': profile_url
})
print(data)
with open(file_path + file_name ,'a+') as cred_CSV:
csvwriter = csv.DictWriter(cred_CSV, delimiter=',',lineterminator='\n',fieldnames=csv_headers)
if idx == 0 and count == 0:
print('Writing CSV header now...')
csvwriter.writeheader()
else:
for item in list_of_credentials:
print('Writing data rows now..')
print(item)
csvwriter.writerow(item)
list_of_credentials = []
count +=1
page_number +=20
scrap_directory_data()

Index out of range when sending requests in a loop

I encounter an index out of range error when I try to get the number of contributors of a GitHub project in a loop. After some iterations (which are working perfectly) it just throws that exception. I have no clue why ...
for x in range(100):
r = requests.get('https://github.com/tipsy/profile-summary-for-github')
xpath = '//span[contains(#class, "num") and following-sibling::text()[normalize-space()="contributors"]]/text()'
contributors_number = int(html.fromstring(r.text).xpath(xpath)[0].strip().replace(',', ''))
print(contributors_number) # prints the correct number until the exception
Here's the exception.
----> 4 contributors_number = int(html.fromstring(r.text).xpath(xpath)[0].strip().replace(',', ''))
IndexError: list index out of range

It seems likely that you're getting a 429 - Too many requests since you're firing requests of one after the other.
You might want to modify your code as such:
import time
for index in range(100):
r = requests.get('https://github.com/tipsy/profile-summary-for-github')
xpath = '//span[contains(#class, "num") and following-sibling::text()[normalize-space()="contributors"]]/text()'
contributors_number = int(html.fromstring(r.text).xpath(xpath)[0].strip().replace(',', ''))
print(contributors_number)
time.sleep(3) # Wait a bit before firing of another request
Better yet would be:
import time
for index in range(100):
r = requests.get('https://github.com/tipsy/profile-summary-for-github')
if r.status_code in [200]: # Check if the request was successful
xpath = '//span[contains(#class, "num") and following-sibling::text()[normalize-space()="contributors"]]/text()'
contributors_number = int(html.fromstring(r.text).xpath(xpath)[0].strip().replace(',', ''))
print(contributors_number)
else:
print("Failed fetching page, status code: " + str(r.status_code))
time.sleep(3) # Wait a bit before firing of another request

Now this works perfectly for me while using the API. Probably the cleanest way of doing it.
import requests
import json
url = 'https://api.github.com/repos/valentinxxx/nginxconfig.io/commits?&per_page=100'
response = requests.get(url)
commits = json.loads(response.text)
commits_total = len(commits)
page_number = 1
while(len(commits) == 100):
page_number += 1
url = 'https://api.github.com/repos/valentinxxx/nginxconfig.io/commits?&per_page=100'+'&page='+str(page_number)
response = requests.get(url)
commits = json.loads(response.text)
commits_total += len(commits)

GitHub is blocking your repeated requests. Do not scrape sites in quick succession, many website operators actively block too many requests. As a result, the content that is returned no longer matches your XPath query.
You should be using the REST API that GitHub provides to retrieve project stats like the number of contributors, and you should implement some kind of rate limiting. There is no need to retrieve the same number 100 times, contributor counts do not change that rapidly.
API responses include information on how many requests you can make in a time window, and you can use conditional requests to only incur rate limit costs when the data actually has changed:
import requests
import time
from urllib.parse import parse_qsl, urlparse
owner, repo = 'tipsy', 'profile-summary-for-github'
github_username = '....'
# token = '....' # optional Github basic auth token
stats = 'https://api.github.com/repos/{}/{}/contributors'
with requests.session() as sess:
# GitHub requests you use your username or appname in the header
sess.headers['User-Agent'] += ' - {}'.format(github_username)
# Consider logging in! You'll get more quota
# sess.auth = (github_username, token)
# start with the first, move to the last when available, include anonymous
last_page = stats.format(owner, repo) + '?per_page=100&page=1&anon=true'
while True:
r = sess.get(last_page)
if r.status_code == requests.codes.not_found:
print("No such repo")
break
if r.status_code == requests.codes.no_content:
print("No contributors, repository is empty")
break
if r.status_code == requests.codes.accepted:
print("Stats not yet ready, retrying")
elif r.status_code == requests.codes.not_modified:
print("Stats not changed")
elif r.ok:
# success! Check for a last page, get that instead of current
# to get accurate count
link_last = r.links.get('last', {}).get('url')
if link_last and r.url != link_last:
last_page = link_last
else:
# this is the last page, report on count
params = dict(parse_qsl(urlparse(r.url).query))
page_num = int(params.get('page', '1'))
per_page = int(params.get('per_page', '100'))
contributor_count = len(r.json()) + (per_page * (page_num - 1))
print("Contributor count:", contributor_count)
# only get us a fresh response next time
sess.headers['If-None-Match'] = r.headers['ETag']
# pace ourselves following the rate limit
window_remaining = int(r.headers['X-RateLimit-Reset']) - time.time()
rate_remaining = int(r.headers['X-RateLimit-Remaining'])
# sleep long enough to honour the rate limit or at least 100 milliseconds
time.sleep(max(window_remaining / rate_remaining, 0.1))
The above uses a requests session object to handle repeated headers and ensure that you get to reuse connections where possible.
A good library such as github3.py (incidentally written by a requests core contributor) will take care of most of those details for you.
If you do want to persist on scraping the site directly, you do take a risk that the site operators block you altogether. Try to take some responsibility by not hammering the site continually.
That means that at the very least, you should honour the Retry-After header that GitHub gives you on 429:
if not r.ok:
print("Received a response other that 200 OK:", r.status_code, r.reason)
retry_after = r.headers.get('Retry-After')
if retry_after is not None:
print("Response included a Retry-After:", retry_after)
time.sleep(int(retry_after))
else:
# parse OK response

Web crawler does not open all links in a page

I'am trying to build a web crawler using beautifulsoup and urllib. The crawler is working, but it does not open all the pages in a site. It opens the first link and goes to that link, opens the first link of that page and so on.
Here's my code:
from bs4 import BeautifulSoup
from urllib.request import urlopen
from urllib.parse import urljoin
import json, sys
sys.setrecursionlimit(10000)
url = input('enter url ')
d = {}
d_2 = {}
l = []
url_base = url
count = 0
def f(url):
global count
global url_base
if count <= 100:
print("count: " + str(count))
print('now looking into: '+url+'\n')
count += 1
l.append(url)
html = urlopen(url).read()
soup = BeautifulSoup(html, "html.parser")
d[count] = soup
tags = soup('a')
for tag in tags:
meow = tag.get('href',None)
if (urljoin(url, meow) in l):
print("Skipping this one: " + urljoin(url,meow))
elif "mailto" in urljoin(url,meow):
print("Skipping this one with a mailer")
elif meow == None:
print("skipping 'None'")
elif meow.startswith('http') == False:
f(urljoin(url, meow))
else:
f(meow)
else:
return
f(url)
print('\n\n\n\n\n')
print('Scrapping Completed')
print('\n\n\n\n\n')

The reason you're seeing this behavior is due to when the code recursively calls your function. As soon as the code finds a valid link, the function f gets called again preventing the rest of the for loop from running until it returns.
What you're doing is a depth first search, but the internet is very deep. You want to do a breadth first search instead.
Probably the easiest way to modify your code to do that is to have a global list of links to follow. Have the for loop append all the scraped links to the end of this list and then outside of the for loop, remove the first element of the list and follow that link.
You may have to change your logic slightly for your max count.

If count reaches 100, no further links will be opened. Therefore I think you should decrease count by one after leaving the for loop. If you do this, count would be something like the current link depth (and 100 would be the maximum link depth).
If the variable count should refer to the number of opened links, then you might want to control the link depth in another way.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

How to download dynamically loaded images using python and seleniumwire? - python

Related

StaleElementReferenceException | WebDriver losing reference after function call

Can't Stop ThreadPoolExecutor

Can I pause a scroll function in selenium, scrape the current data, and then continue scrolling later in the script?

Index out of range when sending requests in a loop

Web crawler does not open all links in a page

Categories

Resources