Already complete scraping scrapes everything on the page. I would like to limit the scraping to only a certain section

Already complete scraping scrapes everything on the page. I would like to limit the scraping to only a certain section - python

I placed the code of a complete and properly functioning scraping that I own. Successfully scrapes all elements on the page.
However, I would like to scrape only a small limited section of the page with the same elements as scraping. This limited section is already scraped correctly along with all elements of the page, but I would like to scrape only it and not "all + it". The link is here
There are 4 tables on the page, but I would like to scrape just one, that is the table called "Programma", ie the html section "event-summary event" or "leagues-static event-summary-leagues ". But of this section only the elements of the last round (Matchday 14). Matchday 14 only. No round 15. So obviously that with each update of the page rounds, the last round is always scraped as well.
So I would need to insert something that makes scraping understand to download only the elements (which it already owns and scrapes) of of that section and the last round.
The code is already complete and works fine, so I'm not looking for code services, but for a little hint to tell me how to limit the scraping to just the section mentioned above. Scraping is in Selenium. I would like to stick with Selenium and my code as it is already functional and complete. Thanks
import selenium
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
driver = webdriver.Firefox()
driver.get("url")
driver.implicitly_wait(12)
#driver.minimize_window()
wait = WebDriverWait(driver, 10)
all_rows = driver.find_elements(By.CSS_SELECTOR, "div[class^='event__round'],div[class^='event__match']")
current_round = '?'
for bundesliga in all_rows:
classes = bundesliga.get_attribute('class')
#print(classes)
if 'event__round' in classes:
#round = row.find_elements(By.CSS_SELECTOR, "[class^='event__round event__round--static']")
#current_round = row.text # full text `Round 20`
current_round = bundesliga.text.split(" ")[-1] # only `20` without `Round`
else:
datetime = bundesliga.find_element(By.CSS_SELECTOR, "[class^='event__time']")
#Divide la data e l'ora
date, time = datetime.text.split(" ")
date = date.rstrip('.') # right-strip to remove `.` at the end of date
team_home = bundesliga.find_element(By.CSS_SELECTOR, "[class^='event__participant event__participant--home']")
team_away = bundesliga.find_element(By.CSS_SELECTOR, "[class^='event__participant event__participant--away']")
score_home = bundesliga.find_element(By.CSS_SELECTOR, "[class^='event__score event__score--home']")
score_away = bundesliga.find_element(By.CSS_SELECTOR, "[class^='event__score event__score--away']")
bundesliga = [current_round, date, time, team_home.text, team_away.text, score_home.text, score_away.text]
bundesliga.append(bundesliga)
print(bundesliga)

I think all you need to do is limit all_rows variable. One way to do this is finding the tab you are looking for with text and then getting the parent elements.
import selenium
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from selenium.common.exceptions import TimeoutException, NoSuchElementException
driver = webdriver.Firefox()
driver.get("https://www.someurl/some/other/page")
driver.implicitly_wait(12)
#driver.minimize_window()
wait = WebDriverWait(driver, 10)
# all_rows = driver.find_elements(By.CSS_SELECTOR, "div[class^='event__round'],div[class^='event__match']")
############### UPDATE ####################
def parent_element(element):
return element.find_element(By.XPATH, './..')
programma_element = WebDriverWait(driver, 10).until(
EC.visibility_of_element_located((By.XPATH, "//div[text()='Programma']")))
programma_element_p1 = parent_element(programma_element)
programma_element_p2 = parent_element(programma_element_p1)
programma_element_p3 = parent_element(programma_element_p2)
all_rows = programma_element_p3.find_elements(By.CSS_SELECTOR, "div[class^='event__round'],div[class^='event__match']")
filter_rows = []
for row in all_rows:
if "event__match--last" in row.get_attribute('class'):
filter_rows.append(row)
break
else:
filter_rows.append(row)
############### UPDATE ####################
current_round = '?'
for bundesliga in filter_rows:
classes = bundesliga.get_attribute('class')
#print(classes)
if 'event__round' in classes:
#round = row.find_elements(By.CSS_SELECTOR, "[class^='event__round event__round--static']")
#current_round = row.text # full text `Round 20`
current_round = bundesliga.text.split(" ")[-1] # only `20` without `Round`
else:
datetime = bundesliga.find_element(By.CSS_SELECTOR, "[class^='event__time']")
#Divide la data e l'ora
date, time = datetime.text.split(" ")
date = date.rstrip('.') # right-strip to remove `.` at the end of date
team_home = bundesliga.find_element(By.CSS_SELECTOR, "[class^='event__participant event__participant--home']")
team_away = bundesliga.find_element(By.CSS_SELECTOR, "[class^='event__participant event__participant--away']")
# score_home = bundesliga.find_element(By.CSS_SELECTOR, "[class^='event__score event__score--home']")
# score_away = bundesliga.find_element(By.CSS_SELECTOR, "[class^='event__score event__score--away']")
try:
score_home = bundesliga.find_element(By.CSS_SELECTOR, "[class^='event__score event__score--home']")
except (TimeoutException, NoSuchElementException):
MyObject = type('MyObject', (object,), {})
score_home = MyObject()
score_home.text = "-"
try:
score_away = bundesliga.find_element(By.CSS_SELECTOR, "[class^='event__score event__score--away']")
except (TimeoutException, NoSuchElementException):
MyObject = type('MyObject', (object,), {})
score_away = MyObject()
score_away.text = "-"
bundesliga = [current_round, date, time, team_home.text, team_away.text, score_home.text, score_away.text]
bundesliga.append(bundesliga)
print(bundesliga)

Related

Scraped data is not saving to csv file as it keeps returning a blank csv file

My scraper is calling the website and hitting each of the 44 pages and creating a csv file but the csv file is empty. I am returning after each of the functions and saving the data to a csv at the end of the scraper.
Can anyone see what is wrong with my code?
Code:
import pandas,requests,bs4,time
from seleniumwire import webdriver
from webdriver_manager.firefox import GeckoDriverManager
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from selenium.webdriver.common.by import By
import datetime
TODAY = datetime.datetime.today().strftime("%Y%m%d")
SAVE_FILENAME = "/Users/180284/jupyter-1.0.0/pssi_jobs-"+TODAY+".csv"
driver = webdriver.Chrome('~/Desktop/chromedriver_mac64')
driver.implicitly_wait(30)
URL_BASE = "https://jobs.pssi.com/us/en/search-resultskeywords=%22food%20safety%20team%20member%22&s=1"
MAX_PAGE = 44
HEADERS = {
'From': 'myemail'
}
def interceptor(request):
del request.headers['From']
request.headers['From'] = HEADERS["From"]
driver.request_interceptor = interceptor
def parse_job_post_div(div_html):
soup = bs4.BeautifulSoup(div_html)
job_ls = soup.findAll("div",{"class":"information"})
job_data = []
for job in job_ls:
job_listing = job.find("div",{"class":"information"}).get_text(separator=", ").strip()
title = job.find("span",{"role":"heading"}).get_text(separator=", ").strip()
job_location = job.find("p",{"class":"job-info"}).get_text(separator=", ").strip()
new_row = {"job_listing":job,"title":title,"job_location":job_location}
job_data.append(new_row)
return job_data
def get_data(wd):
job_postings = driver.find_element(By.CLASS_NAME, "information")
html = job_postings.get_attribute("innerHTML")
parsed = parse_job_post_div(html)
return pandas.DataFrame(parsed)
def process_page(url):
driver.get(url)
master_data = []
i = 0
while True:
df = get_data(driver)
master_data.append(df)
if i == (MAX_PAGE - 1):
break
driver.find_element(By.XPATH, "//span[#class='icon icon-arrow-right']").click()
time.sleep(10)
print(i)
i+=1
return pandas.concat(master_data,ignore_index=True)
data = process_page(URL_BASE)
data.to_csv(SAVE_FILENAME)
`
I have tried the above code.

The first problem I found in your code is that the job_ls is an empty list, i.e. soup.findAll("div",{"class":"information"}) doesn't find anything.
Moreover, job_postings contains only one webelement (i.e. the first job of the list) instead of all 10 jobs shown in the page, that's because you used .find_element instead of .find_elements. As a result of these and other problems, process_page(URL_BASE) returns an empty dataframe.
In this case you can speed up the process and use less code using directly selenium instead of bs4
driver.get(URL_BASE)
driver.implicitly_wait(30)
MAX_PAGE = 4
titles, locations, descriptions = [], [], []
for i in range(MAX_PAGE):
print('current page:',i+1,end='\r')
titles += [title.text for title in driver.find_elements(By.CSS_SELECTOR, '.information > span[role=heading]')]
locations += [loc.text.replace('\n',', ') for loc in driver.find_elements(By.CSS_SELECTOR, '.information > p[class=job-info]')]
descriptions += [title.text for title in driver.find_elements(By.CSS_SELECTOR, '.information > p[data-ph-at-id=jobdescription-text')]
if i < MAX_PAGE-1:
driver.find_element(By.XPATH, "//span[#class='icon icon-arrow-right']").click()
else:
break
df = pandas.DataFrame({'title':titles,'location':locations,'description':descriptions})
df.to_csv(SAVE_FILENAME, index=False)
and df will be something like

Webscraping Multiple Pages in Python with Selenium - loop not working

I'm quite new to python and have written a script using selenium to scrape a website. I've tried everything but can't get the loop to cycle through pages. It currently just repeats the data on the first page 5 times. I want to scrape all the pages for 'BR1' any help would be great, currently the script below only scrapes the first page 5 times.
from selenium import webdriver
from selenium.webdriver.common.keys import Keys
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
import time
with open('rightmove.csv', 'w') as file:
file.write('PropertyCardcontent \n')
PATH = ("/usr/local/bin/chromedriver")
driver = webdriver.Chrome(PATH)
driver.get("https://www.rightmove.co.uk/house-prices.html")
print(driver.title)
elem = driver.find_element(By.NAME, 'searchLocation') # Find the search box
elem.send_keys('BR1' + Keys.RETURN)
try:
content = WebDriverWait(driver, 15).until(
EC.presence_of_element_located((By.ID,'content'))
)
finally:
time.sleep(3)
for p in range(5):
sold = content.find_elements(By.CLASS_NAME, 'sold-prices-content-wrapper ')
for solds in sold:
address = solds.find_elements(By.CLASS_NAME, 'sold-prices-content ')
for addresses in address:
result = addresses.find_elements(By.CLASS_NAME, 'results ')
for results in result:
card = results.find_elements(By.CLASS_NAME,'propertyCard')
for propertyCard in card:
header = propertyCard.find_elements(By.CLASS_NAME,'propertyCard-content')
for propertyCardcontent in header:
road = propertyCardcontent.find_elements(By.CLASS_NAME,'title')
for propertyCardcontent in header:
road = propertyCardcontent.find_elements(By.CLASS_NAME,'subTitle')
for subtitle in road:
bed = subtitle.find_elements(By.CLASS_NAME, 'propertyType')
with open('rightmove.csv', 'a') as file:
for i in range(len(result)):
file.write(header[i].text + '\n')
button = driver.find_element(By.XPATH, '//*[#id="content"]/div[2]/div[2]/div[4]/div[27]/div[3]/div')
button.click()
file.close()
time.sleep(3)
driver.quit()

Since the website link has page number on it, I recommend you put the base url as "https://www.rightmove.co.uk/house-prices/br1.html?page=1", and loop through the pages while changing the last index of the url with methods like format string.
One other thing, you don't need to implement all those for loops, you can simply assign each variable to its specific value since everything you need is inside an html block which is easy to navigate on it.
Update:
I'm sorry for being late, had unexpected stuff(...).
I've made some changes as I use Brave, so make sure you select your browser, Chrome I believe, the chromedriver(ver:102) stays the same (or depending your Chrome version).
I've also got the Price and Date and stored them in a tuple.
Every record is stored in a list[Title, propertyType, tupleof(Price_Date)]
At the end, it creates a csv and stores everything inside with a ";" as delimter.
You can if you prefer split the price and date for later use, up to you.
Note: This looping method only applies to websites in which the number of page is included within the URL. In this case, both the key and number of page is included in the URL.
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from selenium.webdriver.chrome.options import Options
from selenium.webdriver.chrome.service import Service
import time
import random
import itertools
options = Options()
options.binary_location = r'C:\Program Files\BraveSoftware\Brave-Browser\Application\brave.exe'
driver = webdriver.Chrome(options = options, service = Service("chromedriver.exe"))
key_word = "BR1".lower()
base_url = f"https://www.rightmove.co.uk/house-prices/{key_word}.html?page=1"
driver.get(base_url)
#Number of pages
pages = driver.find_element(By.XPATH, '//span[#class="pagination-label"][2]').text
pages = int(pages.strip('of'))
WebDriverWait(driver, 15).until(
EC.presence_of_element_located((By.CLASS_NAME, 'results '))
)
data = []
pc = 0
for p in range(1,pages+1):
driver.get(f"https://www.rightmove.co.uk/house-prices/{key_word}.html?page={p}")
WebDriverWait(driver, 10).until(
EC.presence_of_element_located((By.XPATH, '//div//div[#class="propertyCard"]'))
)
propertyCards = driver.find_elements(By.XPATH, '//div//div[#class="propertyCard"]')
for propertyCard in propertyCards:
title = propertyCard.find_element(By.CLASS_NAME, 'title').text
propertyType = propertyCard.find_element(By.CLASS_NAME, 'propertyType').text
price_list = propertyCard.find_elements(By.CLASS_NAME, 'price')
date_list = propertyCard.find_elements(By.CLASS_NAME, 'date-sold')
data.append([title,propertyType])
for p, d in itertools.zip_longest(price_list, date_list , fillvalue = None):
try:
price = p.text
date = d.text
data[pc].append((price, date))
except Exception as e:
print(e)
pc+=1
time.sleep(random.randint(1,4))
print(data)
with open('rightmove.csv', 'w') as file:
header = "Title;propertyType;Price_Date\n"
file.write(header)
for record in data:
file.write("{};{};{}\n".format(record[0],record[1],record[2:]))
driver.quit()

You don't have to go down to dom elem by elem, you can just use xpath or class_name (if it's unique, otherwise it's better xpath or css-selector) and get the item you are looking for.
Anyway follow this:
import time
import selenium.webdriver as webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.common.keys import Keys
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
driver = webdriver.Chrome("/usr/local/bin/chromedriver")
driver.get("https://www.rightmove.co.uk/house-prices.html")
# send query
query = "BR1"
search_bar = driver.find_element(By.XPATH, '//input[#class="searchBox ac_input"]')
search_bar.send_keys(query)
search_bar.send_keys(Keys.ENTER)
# wait to result been loaded
WebDriverWait(driver, 10).until(
EC.presence_of_element_located((By.CLASS_NAME, 'propertyCard'))
)
#get amount of pages
pages = driver.find_element(By.XPATH, '//span[#class="pagination-label"][2]').text
pages = int(pages.replace('of ', ''))
data = []
i = 1
while i <= pages:
WebDriverWait(driver, 10).until(
EC.element_to_be_clickable((By.XPATH, '//div[contains(text(), "Next")]'))
).click()
# wait page load result
WebDriverWait(driver, 10).until(
EC.presence_of_element_located((By.XPATH, '//div//div[#class="propertyCard"]'))
)
propertyCards = driver.find_elements(By.XPATH, '//div//div[#class="propertyCard"]')
# loop over result and store data
for propertyCard in propertyCards:
title = propertyCard.find_element(By.CLASS_NAME, 'title').text
propertyType = propertyCard.find_element(By.CLASS_NAME, 'propertyType').text
data.append((title, propertyType))
time.sleep(1)
i += 1
print("you reach the last page")
#get number of results
printf(data)
driver.close()
I use a list of tuple cause in your example you want store 2 item, if you want store more data you can use a dict and then convert into csv with Dictwriter directly. Enjoy.

Python Selenium: Changing from three loops to one loop repeat the same information

I am extracting google reviews of a resturant. I am interested in extracting reviewer name, rating given by reviewer, and text of the review. I used following code for the extraction:
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from selenium.webdriver.chrome.options import Options
from selenium.webdriver.common.action_chains import ActionChains
import time
driver = webdriver.Chrome('')
base_url = 'https://www.google.com/search?tbs=lf:1,lf_ui:9&tbm=lcl&sxsrf=AOaemvJFjYToqQmQGGnZUovsXC1CObNK1g:1633336974491&q=10+famous+restaurants+in+Dunedin&rflfq=1&num=10&sa=X&ved=2ahUKEwiTsqaxrrDzAhXe4zgGHZPODcoQjGp6BAgKEGo&biw=1280&bih=557&dpr=2#lrd=0xa82eac0dc8bdbb4b:0x4fc9070ad0f2ac70,1,,,&rlfi=hd:;si:5749134142351780976,l,CiAxMCBmYW1vdXMgcmVzdGF1cmFudHMgaW4gRHVuZWRpbiJDUjEvZ2VvL3R5cGUvZXN0YWJsaXNobWVudF9wb2kvcG9wdWxhcl93aXRoX3RvdXJpc3Rz2gENCgcI5Q8QChgFEgIIFkiDlJ7y7YCAgAhaMhAAEAEQAhgCGAQiIDEwIGZhbW91cyByZXN0YXVyYW50cyBpbiBkdW5lZGluKgQIAxACkgESaXRhbGlhbl9yZXN0YXVyYW50mgEkQ2hkRFNVaE5NRzluUzBWSlEwRm5TVU56ZW5WaFVsOUJSUkFCqgEMEAEqCCIEZm9vZCgA,y,2qOYUvKQ1C8;mv:[[-45.8349553,170.6616387],[-45.9156414,170.4803685]]'
driver.get(base_url)
WebDriverWait(driver,10).until(EC.element_to_be_clickable((By.XPATH,"//div[./span[text()='Newest']]"))).click()
total_reviews_text =driver.find_element_by_xpath("//div[#class='review-score-container']//div//div//span//span[#class='z5jxId']").text
num_reviews = int (total_reviews_text.split()[0])
all_reviews = WebDriverWait(driver, 20).until(EC.presence_of_all_elements_located((By.CSS_SELECTOR, 'div.gws-localreviews__google-review')))
time.sleep(2)
total_reviews = len(all_reviews)
while total_reviews < num_reviews:
driver.execute_script('arguments[0].scrollIntoView(true);', all_reviews[-1])
WebDriverWait(driver, 5, 0.25).until_not(EC.presence_of_element_located((By.CSS_SELECTOR, 'div[class$="activityIndicator"]')))
#all_reviews = driver.find_elements_by_css_selector('div.gws-localreviews__google-review')
time.sleep(5)
all_reviews = WebDriverWait(driver, 5).until(EC.presence_of_all_elements_located((By.CSS_SELECTOR, 'div.gws-localreviews__google-review')))
print(total_reviews)
total_reviews +=5
person_info = driver.find_elements_by_xpath("//div[#id='reviewSort']//div[contains(#class,'google-review')]")
rating_info = driver.find_elements_by_xpath("//div[#class='PuaHbe']")
review_text = driver.find_elements_by_xpath("//div[#class='Jtu6Td']")
for person in person_info:
name = person.find_element_by_xpath("./div/div/div/a").text
print(name)
for rating in rating_info:
rating_txt = person.find_element_by_xpath("./g-review-stars/span").get_attribute('aria-label')
print(rating_txt)
for text in review_text:
texts = text.find_element_by_xpath("./span").text
print(texts)
The above code worked as per expectations. I want to make slight change in above code. Instead of using three loops to display name, rating, and review_text. I wanted to extract the same information using one loop. So I made following changes in the above code:
reviews_info = driver.find_elements_by_xpath("//div[#class='jxjCjc']")
for review_info in reviews_info:
name = review_info.find_element_by_xpath("./div/div/a").text
rating = review_info.find_element_by_xpath("//div[#class='PuaHbe']//g-review-stars//span").get_attribute('aria-label')
text = review_info.find_element_by_xpath("//div[#class='Jtu6Td']//span").text
print(name)
print(rating)
print(text)
print()
The problem with a change in code is that it displays the same information (i.e. rating and text) for all reviewers names. I am not sure where am I making the mistake. Any help to fix the issue would be really appreciated.

from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
import time
driver = webdriver.Chrome()
base_url = 'https://www.google.com/search?tbs=lf:1,lf_ui:9&tbm=lcl&sxsrf=AOaemvJFjYToqQmQGGnZUovsXC1CObNK1g:1633336974491&q=10+famous+restaurants+in+Dunedin&rflfq=1&num=10&sa=X&ved=2ahUKEwiTsqaxrrDzAhXe4zgGHZPODcoQjGp6BAgKEGo&biw=1280&bih=557&dpr=2#lrd=0xa82eac0dc8bdbb4b:0x4fc9070ad0f2ac70,1,,,&rlfi=hd:;si:5749134142351780976,l,CiAxMCBmYW1vdXMgcmVzdGF1cmFudHMgaW4gRHVuZWRpbiJDUjEvZ2VvL3R5cGUvZXN0YWJsaXNobWVudF9wb2kvcG9wdWxhcl93aXRoX3RvdXJpc3Rz2gENCgcI5Q8QChgFEgIIFkiDlJ7y7YCAgAhaMhAAEAEQAhgCGAQiIDEwIGZhbW91cyByZXN0YXVyYW50cyBpbiBkdW5lZGluKgQIAxACkgESaXRhbGlhbl9yZXN0YXVyYW50mgEkQ2hkRFNVaE5NRzluUzBWSlEwRm5TVU56ZW5WaFVsOUJSUkFCqgEMEAEqCCIEZm9vZCgA,y,2qOYUvKQ1C8;mv:[[-45.8349553,170.6616387],[-45.9156414,170.4803685]]'
driver.get(base_url)
WebDriverWait(driver,10).until(EC.element_to_be_clickable((By.XPATH,"//div[./span[text()='Newest']]"))).click()
total_reviews_text =driver.find_element_by_xpath("//div[#class='review-score-container']//div//div//span//span[#class='z5jxId']").text
num_reviews = int (total_reviews_text.split()[0])
print("NUm reviews=", num_reviews)
all_reviews = WebDriverWait(driver, 20).until(EC.presence_of_all_elements_located((By.CSS_SELECTOR, 'div.gws-localreviews__google-review')))
time.sleep(2)
total_reviews = len(all_reviews)
print("Total reviews=", total_reviews)
s = "(//div[#id='reviewSort']//div[contains(#class,'google-review')])[0]"
b = '0'
a = 1 # Index of Review button
for i in range(10):
c = str(a)
s = s.replace(b, c) # Updating Xpath's index in every loop so that it can focus on new review everytime.
b = str(a)
a = a + 1
WebDriverWait(driver, 5, 0.25).until_not(EC.presence_of_element_located((By.CSS_SELECTOR, 'div[class$="activityIndicator"]')))
time.sleep(5)
all_reviews = WebDriverWait(driver, 5).until(EC.presence_of_all_elements_located((By.CSS_SELECTOR, 'div.gws-localreviews__google-review')))
total_reviews +=1
Info = driver.find_element_by_xpath(s).text
print(Info)
print("<------------------------------------------------------>\n\n")
Output:-
Click Here to See Program Output

Wrong output in the CSV file using xPath expression

I wrote a code to get the following value "Exam Code", "Exam Name" and "Total Question". The issue is that in the put CSV file I am getting the wrong value in the "Exam Code" column. I am getting the same value as "Exam Name". The xPath looks fine to me. I don't know where is the issue happening.
Following is the code:
import pandas as pd
from selenium import webdriver
from selenium.webdriver.chrome.options import Options
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from selenium.common.exceptions import NoSuchElementException
import time
option = Options()
option.add_argument("--disable-infobars")
option.add_argument("start-maximized")
option.add_argument("--disable-extensions")
option.add_experimental_option("excludeSwitches", ['enable-automation'])
# Pass the argument 1 to allow and 2 to block
# option.add_experimental_option("prefs", {
# "profile.default_content_setting_values.notifications": 1
# })
driver = webdriver.Chrome(chrome_options=option, executable_path='C:\\Users\\Awais\\Desktop\\web crawling\\chromedriver.exe')
url = ["https://www.marks4sure.com/210-060-exam.html",
"https://www.marks4sure.com/210-065-exam.html",
"https://www.marks4sure.com/200-355-exam.html",
"https://www.marks4sure.com/9A0-127-exam.html",
"https://www.marks4sure.com/300-470-exam.html",]
driver.implicitly_wait(0.5)
na = "N/A"
# text = 'Note: This exam is available on Demand only. You can Pre-Order this Exam and we will arrange this for you.'
links = []
exam_code = []
exam_name = []
total_q = []
for items in range(0, 5):
driver.get(url[items])
# if driver.find_element_by_xpath("//div[contains(#class, 'alert') and contains(#class, 'alert-danger')]") == text:
# continue
items += 1
try:
c_url = driver.current_url
links.append(c_url)
except:
pass
try:
codes = driver.find_element_by_xpath('''//div[contains(#class, 'col-sm-6') and contains(#class, 'exam-row-data') and position() = 2]''')
exam_code.append(codes.text)
except:
exam_code.append(na)
try:
names = driver.find_element_by_xpath('//*[#id="content"]/div/div[1]/div[2]/div[3]/div[2]/a')
exam_name.append(names.text)
except:
exam_name.append(na)
try:
question = driver.find_element_by_xpath('//*[#id="content"]/div/div[1]/div[2]/div[4]/div[2]/strong')
total_q.append(question.text)
except:
total_q.append(na)
continue
all_info = list(zip(links, exam_name, exam_name, total_q))
print(all_info)
df = pd.DataFrame(all_info, columns=["Links", "Exam Code", "Exam Name", "Total Question"])
df.to_csv("data5.csv", index=False)
driver.close()

You are getting the exam name in there twice, and instead of exam codes because that's what you are telling it to do (minor typo here with having exam_name in there twice):
all_info = list(zip(links, exam_name, exam_name, total_q))
change to: all_info = list(zip(links, exam_code, exam_name, total_q))
Few things I'm confused about.
1) Why use Selnium? There is no need for selenium as the data is returned in the initial request in the html source. So I would just use requests as it would speed up the processing.
2) The link and the exam code are already in the url you are iterating through. I would just split or use regex to that string to get the link and the code. You only really need to get the exam name and number of questions then.
With that being said, I adjusted it slightly to just get exam name and number of questions:
import requests
from bs4 import BeautifulSoup
import pandas as pd
urls = ["https://www.marks4sure.com/210-060-exam.html",
"https://www.marks4sure.com/210-065-exam.html",
"https://www.marks4sure.com/200-355-exam.html",
"https://www.marks4sure.com/9A0-127-exam.html",
"https://www.marks4sure.com/300-470-exam.html",]
links = []
exam_code = []
exam_name = []
total_q = []
for url in urls:
response = requests.get(url)
soup = BeautifulSoup(response.text, 'html.parser')
links.append(url)
exam_code.append(url.rsplit('-exam')[0].split('/')[-1])
exam_row = soup.select('div[class*="exam-row-data"]')
for exam in exam_row:
if exam.text == 'Exam Name: ':
exam_name.append(exam.find_next_sibling("div").text)
continue
if 'Questions' in exam.text and 'Total Questions' not in exam.text:
total_q.append(exam.text.strip())
continue
all_info = list(zip(links, exam_code, exam_name, total_q))
print(all_info)
df = pd.DataFrame(all_info, columns=["Links", "Exam Code", "Exam Name", "Total Question"])
df.to_csv("data5.csv", index=False)

Hi to get the exam code I think it is better to work with regex and get it from URL itself.
Also below code gives me the exam codes correctly except for 4th link which has a different structure as compared to others.
# -*- coding: utf-8 -*-
"""
Created on Fri Mar 6 14:48:00 2020
#author: prakh
"""
import pandas as pd
from selenium import webdriver
from selenium.webdriver.chrome.options import Options
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from selenium.common.exceptions import NoSuchElementException
import time
option = Options()
option.add_argument("--disable-infobars")
option.add_argument("start-maximized")
option.add_argument("--disable-extensions")
option.add_experimental_option("excludeSwitches", ['enable-automation'])
# Pass the argument 1 to allow and 2 to block
# option.add_experimental_option("prefs", {
# "profile.default_content_setting_values.notifications": 1
# })
driver = webdriver.Chrome(executable_path='C:/Users/prakh/Documents/PythonScripts/chromedriver.exe')
url = ["https://www.marks4sure.com/210-060-exam.html",
"https://www.marks4sure.com/210-065-exam.html",
"https://www.marks4sure.com/200-355-exam.html",
"https://www.marks4sure.com/9A0-127-exam.html",
"https://www.marks4sure.com/300-470-exam.html",]
driver.implicitly_wait(0.5)
na = "N/A"
# text = 'Note: This exam is available on Demand only. You can Pre-Order this Exam and we will arrange this for you.'
links = []
exam_code = []
exam_name = []
total_q = []
for items in range(0, 5):
driver.get(url[items])
# if driver.find_element_by_xpath("//div[contains(#class, 'alert') and contains(#class, 'alert-danger')]") == text:
# continue
items += 1
try:
c_url = driver.current_url
links.append(c_url)
except:
pass
try:
codes = driver.find_element_by_xpath('//*[#id="content"]/div/div[1]/div[2]/div[2]/div[2]')
exam_code.append(codes.text)
except:
exam_code.append(na)
try:
names = driver.find_element_by_xpath('//*[#id="content"]/div/div[1]/div[2]/div[3]/div[2]/a')
exam_name.append(names.text)
except:
exam_name.append(na)
try:
question = driver.find_element_by_xpath('//*[#id="content"]/div/div[1]/div[2]/div[4]/div[2]/strong')
total_q.append(question.text)
except:
total_q.append(na)
continue
all_info = list(zip(links, exam_code, exam_name, total_q))
print(all_info)
df = pd.DataFrame(all_info, columns=["Links", "Exam Code", "Exam Name", "Total Question"])
df.to_csv("data5.csv", index=False)
driver.close()

You don't need selenium because the source code contains the info you need without using JavaScript.
Also, most pages redirect to marks4sure.com/200-301-exam.html, so you'll get the same results. Only marks4sure.com/300-470-exam.html don't.
import requests
from bs4 import BeautifulSoup
urls = ["https://www.marks4sure.com/210-060-exam.html",
"https://www.marks4sure.com/210-065-exam.html",
"https://www.marks4sure.com/200-355-exam.html",
"https://www.marks4sure.com/9A0-127-exam.html",
"https://www.marks4sure.com/300-470-exam.html",]
with open("output.csv", "w") as f:
f.write("exam_code,exam_name,exam_quest\n")
for url in urls:
page = requests.get(url)
soup = BeautifulSoup(page.text, 'html5lib')
for n, v in enumerate(soup.find_all(class_ = "col-sm-6 exam-row-data")):
if n == 1:
exam_code = v.text.strip()
if n == 3:
exam_name = v.text.strip()
if n == 5:
exam_quest = v.text.strip()
f.write(f"{exam_code},{exam_name},{exam_quest}\n")

StaleElementReferenceException in iteration with Selenium

Hello I'm trying to scrap some info from t he following page:
http://verify.sos.ga.gov/verification/
My code is the following:
import sys
reload(sys)
sys.setdefaultencoding('utf8')
from selenium import webdriver
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from selenium.webdriver.common.by import By
from selenium.common.exceptions import TimeoutException
from selenium.webdriver.chrome.options import Options
from selenium.webdriver.support.ui import Select
import time
import csv
url = 'http://verify.sos.ga.gov/verification/'
def init_Selenium():
global driver
driver = webdriver.Chrome("/Users/rodrigopeniche/Downloads/chromedriver")
driver.get(url)
def select_profession():
select = Select(driver.find_element_by_name('t_web_lookup__profession_name'))
options = select.options
for index in range(1, len(options) - 1):
select = Select(driver.find_element_by_name('t_web_lookup__profession_name'))
select.select_by_index(index)
select_license_type()
def select_license_type():
select = Select(driver.find_element_by_name('t_web_lookup__license_type_name'))
options = select.options
for index in range(1, len(options) - 1):
select = Select(driver.find_element_by_name('t_web_lookup__license_type_name'))
select.select_by_index(index)
search_button = driver.find_element_by_id('sch_button')
driver.execute_script('arguments[0].click();', search_button)
scrap_licenses_results()
def scrap_licenses_results():
table_rows = driver.find_elements_by_tag_name('tr')
for index, row in enumerate(table_rows):
if index < 9:
continue
else:
attributes = row.find_elements_by_xpath('td')
try:
name = attributes[0].text
license_number = attributes[1].text
profession = attributes[2].text
license_type = attributes[3].text
status = attributes[4].text
address = attributes[5].text
license_details_page_link = attributes[0].find_element_by_id('datagrid_results__ctl3_name').get_attribute('href')
driver.get(license_details_page_link)
data_rows = driver.find_elements_by_class_name('rdata')
issued_date = data_rows[len(data_rows) - 3].text
expiration_date = data_rows[len(data_rows) - 2].text
last_renewal_day = data_rows[len(data_rows) - 1].text
print name, license_number, profession, license_type, status, address, issued_date, expiration_date, last_renewal_day
driver.back()
except:
pass
init_Selenium()
select_profession()
When I execute the script it works for the first iteration but fails in the second one. The exact place where the error is raised is in the scrap_licenses_results() function, in the attributes = row.find_elements_by_xpath('td') line.
Any help will be appreciated

The staleElementReferenceException is due to the list of rows gathered before loop iteration. Initially, You created a list of all rows ,named table_rows.
table_rows = driver.find_elements_by_tag_name('tr')
Now in loop, during first iteration, your first row element is fresh and can be found by the driver. At the end of first iteration, you are doing driver.back(), your page changes/refreshes HTML DOM . All the previously gathered references are lost now. All the rows in your table_rows list are now stale. Hence, in 2nd iteration you are facing such exception.
You have to move the find row operation in the loop, so that everytime a fresh reference is found on target application. The psuedocode shall do Something like this.
total_rows = driver.find_elements_by_tag_name('tr').length()
for i in total_rows
driver.find_element_by_xpath('//tr[i]')
.. further code..

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Already complete scraping scrapes everything on the page. I would like to limit the scraping to only a certain section - python

Related

Scraped data is not saving to csv file as it keeps returning a blank csv file

Webscraping Multiple Pages in Python with Selenium - loop not working

Python Selenium: Changing from three loops to one loop repeat the same information

Wrong output in the CSV file using xPath expression

StaleElementReferenceException in iteration with Selenium

Categories

Resources