Python- Selenium/BeautifulSoup PDF & Table scraper

Python- Selenium/BeautifulSoup PDF & Table scraper - python

I am trying to write a code that is a robust web scraper that can be used on a variety of key words, which is formatted currently to find the first pdf it finds from the first page of a google search and if not found then it clicks into the first google search link and then grabs the closest HTML table and stores it into a pandas dataframe which will later be sent to pd._to_excel to change it into an excel file. I've asked a few questions like this before but I believe I have a decent amount of components:
a = ["term 1", "term 2 ", "term 3"]
b = ["school 1", "school 2 ", "school 3"]
c = ["program 1", "program 2", "program 3"]
keys = []
for x,y,z in [(x,y,z) for x in a for y in b for z in c]:
keys.append(z +" "+ x +" "+ y)
path_to_driver = r"C:\wherever_you_chose_the_download_path_to_be\chromedriver.exe"
download_dir = r"C:\wherever_you_want_to_place_the_downloaded_file\name_of_pdf_holder_file"
chrome_options = Options()
chrome_options.add_experimental_option("prefs", {
"download.default_directory": download_dir,
"download.prompt_for_download": False,
})
chrome_options.add_argument("--headless")
chrome_options.add_argument("--ignore-certificate-errors")
chrome_options.add_argument("--incognito")
driver = webdriver.Chrome(path_to_driver, options=chrome_options)
driver.command_executor._commands["send_command"] = ("POST", '/session/$sessionId/chromium/send_command')
params = {'cmd': 'Page.setDownloadBehavior', 'params': {'behavior': 'allow', 'downloadPath': download_dir}}
command_result = driver.execute("send_command", params)
for k, key in enumerate(keys):
try:
start = time.time()
driver.implicitly_wait(10)
driver.get("https://www.google.com/")
sleep_between_interactions = 5
searchbar = driver.find_element_by_name("q")
searchbar.send_keys(key)
searchbar.send_keys(Keys.ARROW_DOWN)
searchbar.send_keys(Keys.RETURN)
pdf_element = driver.find_elements(By.XPATH, ("//a[contains(#href, '.pdf')]"))
key_index_number = str(keys.index(key) +1 )
key_length = str(len(keys))
print(key_index_number + " out of " + key_length)
if len(pdf_element) > 0 and key_length < key_index_number :
print("pdf found for: "+ key)
pdf_element[0].click()
time.sleep(sleep_between_interactions)
print("downloaded " + key_index_number + " out of "+ str(len(keys)))
elif len(pdf_element) == 0 and key_index_number != key_length:
print("pdf NOT found for "+ key)
print(key + " pdf not downloaded, moving on...")
try:
google_search = f"https://www.google.com/search?q={key}"
driver.get(google_search)
clicked_link = driver.find_element(By.XPATH, '(//h3)[1]/../../a').click()
driver.implicitly_wait(10)
html_source_code = driver.execute_script("return document.body.innerHTML;")
html_soup: BeautifulSoup = BeautifulSoup(html_source_code, 'html.parser')
url = '{}'.format(html_soup)
r = requests.get(url)
soup = bs(r.content, 'lxml')
for table in soup.select('.table'):
tbl = pd.read_html(str(table))[0]
links_column = ['{}'.format(url) + i.select_one('.*')['href'] if i.select_one('.*') is not None else '' for i in table.select('td:nth-of-type(1)')]
tbl['Links'] = links_column
continue
except:
print("something happened, probably a JS error which will be dealt with")
continue
except IndexError as index_error:
print("Couldn't find pdf file for "+"\"" + key + "\""+" due to Index Error moving on....")
print(key_index_number + " out of " + str(len(keys)))
continue
except NoSuchElementException:
print("search bar didn't load, iterating next in loop")
print(" pdf NOT found for "+ key)
print(key + " pdf not downloaded, moving on...")
continue
except ElementNotInteractableException:
print("element either didn't load or doesn't exist")
driver.get("https://www.google.com/")
continue
So far this works okay for finding some pdfs but it's not working for any tables it comes across.
I tried something like this in the past:
url_search = f"https://www.google.com/search?q={key}"
request = requests.get(url_search)
soup = BeautifulSoup(request.text, "lxml")
first_link = soup.find("div", class_="BNeawe").text
links_list.append(first_link)
but it only returned a list of the titles of each link without actually clicking through.
Currently, I'm also thinking about how to store the pdfs by search term groupings.
Lastly, I've attempted to pass and use the html source in the past but that's also given me JavaScriptException which seems to not be an exception I can handle properly in python, I may be missing something here though.
I'm hoping also, if possible, to make the a,b,c,etc. arrays into an input() that can be used for future UI creation, not sure how doable this might be in python but it's worth a shot as long as I can get the main part of this code to work as I would like.
Not sure if this can be accomplished with XPATH or a CSS selector, or how I should go about this problem to avoid issues?

Related

Can I pause a scroll function in selenium, scrape the current data, and then continue scrolling later in the script?

I am a student working on a scraping project and I am having trouble completing my script because it fills my computer's memory with all of the data is stores.
It currently stores all of my data until the end, so my solution to this would be to break up the scrape into smaller bits and then write out the data periodically so it does not just continue to make one big list and then write out at the end.
In order to do this, I would need to stop my scroll method, scrape the loaded profiles, write out the data that I have collected, and then repeat this process without duplicating my data. It would be appreciated if someone could show me how to do this. Thank you for your help :)
Here's my current code:
from selenium import webdriver
from selenium.webdriver.common.keys import Keys
from time import sleep
from selenium.common.exceptions import NoSuchElementException
Data = []
driver = webdriver.Chrome()
driver.get("https://directory.bcsp.org/")
count = int(input("Number of Pages to Scrape: "))
body = driver.find_element_by_xpath("//body")
profile_count = driver.find_elements_by_xpath("//div[#align='right']/a")
while len(profile_count) < count: # Get links up to "count"
body.send_keys(Keys.END)
sleep(1)
profile_count = driver.find_elements_by_xpath("//div[#align='right']/a")
for link in profile_count: # Calling up links
temp = link.get_attribute('href') # temp for
driver.execute_script("window.open('');") # open new tab
driver.switch_to.window(driver.window_handles[1]) # focus new tab
driver.get(temp)
# scrape code
Name = driver.find_element_by_xpath('/html/body/table/tbody/tr/td/table/tbody/tr/td[5]/div/table[1]/tbody/tr/td[1]/div[2]/div').text
IssuedBy = "Board of Certified Safety Professionals"
CertificationorDesignaationNumber = driver.find_element_by_xpath('/html/body/table/tbody/tr/td/table/tbody/tr/td[5]/div/table[1]/tbody/tr/td[3]/table/tbody/tr[1]/td[3]/div[2]').text
CertfiedorDesignatedSince = driver.find_element_by_xpath('/html/body/table/tbody/tr/td/table/tbody/tr/td[5]/div/table[1]/tbody/tr/td[3]/table/tbody/tr[3]/td[1]/div[2]').text
try:
AccreditedBy = driver.find_element_by_xpath('/html/body/table/tbody/tr/td/table/tbody/tr/td[5]/div/table[1]/tbody/tr/td[3]/table/tbody/tr[5]/td[3]/div[2]/a').text
except NoSuchElementException:
AccreditedBy = "N/A"
try:
Expires = driver.find_element_by_xpath('/html/body/table/tbody/tr/td/table/tbody/tr/td[5]/div/table[1]/tbody/tr/td[3]/table/tbody/tr[5]/td[1]/div[2]').text
except NoSuchElementException:
Expires = "N/A"
info = Name, IssuedBy, CertificationorDesignaationNumber, CertfiedorDesignatedSince, AccreditedBy, Expires + "\n"
Data.extend(info)
driver.close()
driver.switch_to.window(driver.window_handles[0])
with open("Spredsheet.txt", "w") as output:
output.write(','.join(Data))
driver.close()
Test.py
Displaying Test.py.

Try the below approach using requests and beautifulsoup. In the below script i have used the API URL fetched from website itself for ex:-API URL
First it will create the URL(refer first url) for first iteration, add headers and data in .csv file.
Second iteration it will again create the URL(refer second url) with 2 extra params start_on_page=20 & show_per_page=20 where start_on_page number 20 is incremented by 20 on each iteration and show_per_page = 100 defaulted to extract 100 records per iteration so on till all the data dumped in to the .csv file.second iteration API URL
Script is dumping 4 things number, name, location and profile url.
On each iteration data will be appended to .csv file , so your memory issue will get resolved by this approach.
Do not forget to add your system path in file_path variable where do you want to create .csv file before running the script.
import requests
from urllib3.exceptions import InsecureRequestWarning
requests.packages.urllib3.disable_warnings(InsecureRequestWarning)
from bs4 import BeautifulSoup as bs
import csv
def scrap_directory_data():
list_of_credentials = []
file_path = ''
file_name = 'credential_list.csv'
count = 0
page_number = 0
page_size = 100
create_url = ''
main_url = 'https://directory.bcsp.org/search_results.php?'
first_iteration_url = 'first_name=&last_name=&city=&state=&country=&certification=&unauthorized=0&retired=0&specialties=&industries='
number_of_records = 0
csv_headers = ['#','Name','Location','Profile URL']
while True:
if count == 0:
create_url = main_url + first_iteration_url
print('-' * 100)
print('1 iteration URL created: ' + create_url)
print('-' * 100)
else:
create_url = main_url + 'start_on_page=' + str(page_number) + '&show_per_page=' + str(page_size) + '&' + first_iteration_url
print('-' * 100)
print('Other then first iteration URL created: ' + create_url)
print('-' * 100)
page = requests.get(create_url,verify=False)
extracted_text = bs(page.text, 'lxml')
result = extracted_text.find_all('tr')
if len(result) > 0:
for idx, data in enumerate(result):
if idx > 0:
number_of_records +=1
name = data.contents[1].text
location = data.contents[3].text
profile_url = data.contents[5].contents[0].attrs['href']
list_of_credentials.append({
'#':number_of_records,
'Name':name,
'Location': location,
'Profile URL': profile_url
})
print(data)
with open(file_path + file_name ,'a+') as cred_CSV:
csvwriter = csv.DictWriter(cred_CSV, delimiter=',',lineterminator='\n',fieldnames=csv_headers)
if idx == 0 and count == 0:
print('Writing CSV header now...')
csvwriter.writeheader()
else:
for item in list_of_credentials:
print('Writing data rows now..')
print(item)
csvwriter.writerow(item)
list_of_credentials = []
count +=1
page_number +=20
scrap_directory_data()

How do I export data from multiple pages into a csv file?

I am working on a scraping project, and am in the final stages. Right now, my code can navigate to the first profile, scrape the data from that profile, print that data, then move on to the next profile, and repeat the process. Now, I want to put the data I collect into a csv file instead of printing it. I am not sure how to do this, so I am looking for guidance/updates to my current code. Thank you for your help!
My current code:
from selenium import webdriver
from selenium.webdriver.common.keys import Keys
from time import sleep
from selenium.common.exceptions import NoSuchElementException
driver = webdriver.Chrome("/Users/nzalle/Downloads/chromedriver")
driver.get("https://directory.bcsp.org/")
count = int(input("Number of Profiles to Scrape: "))
body = driver.find_element_by_xpath("//body")
profile_count = driver.find_elements_by_xpath("//div[#align='right']/a")
while len(profile_count) < count: # Get links up to "count"
body.send_keys(Keys.END)
sleep(1)
profile_count = driver.find_elements_by_xpath("//div[#align='right']/a")
for link in profile_count: # Calling up links
temp = link.get_attribute('href') # temp for
driver.execute_script("window.open('');") # open new tab
driver.switch_to.window(driver.window_handles[1]) # focus new tab
driver.get(temp)
# scrape code
Name = driver.find_element_by_xpath('/html/body/table/tbody/tr/td/table/tbody/tr/td[5]/div/table[1]/tbody/tr/td[1]/div[2]/div').text
IssuedBy = "Board of Certified Safety Professionals"
CertificationorDesignaationNumber = driver.find_element_by_xpath('/html/body/table/tbody/tr/td/table/tbody/tr/td[5]/div/table[1]/tbody/tr/td[3]/table/tbody/tr[1]/td[3]/div[2]').text
CertfiedorDesignatedSince = driver.find_element_by_xpath('/html/body/table/tbody/tr/td/table/tbody/tr/td[5]/div/table[1]/tbody/tr/td[3]/table/tbody/tr[3]/td[1]/div[2]').text
try:
AccreditedBy = driver.find_element_by_xpath('/html/body/table/tbody/tr/td/table/tbody/tr/td[5]/div/table[1]/tbody/tr/td[3]/table/tbody/tr[5]/td[3]/div[2]/a').text
except NoSuchElementException:
AccreditedBy = "N/A"
try:
Expires = driver.find_element_by_xpath('/html/body/table/tbody/tr/td/table/tbody/tr/td[5]/div/table[1]/tbody/tr/td[3]/table/tbody/tr[5]/td[1]/div[2]').text
except NoSuchElementException:
Expires = "N/A"
Data = (Name + " , " + IssuedBy + " , " + CertificationorDesignaationNumber + " , " + CertfiedorDesignatedSince + " , " + AccreditedBy + " , " + Expires)
print(Data)
driver.close()
driver.switch_to.window(driver.window_handles[0])
driver.close()

Web crawler not able to process more than one webpage

I am trying to extract some information about mtg cards from a webpage with the following program but I repeatedly retrieve information about the initial page given(InitUrl). The crawler is unable to proceed further. I have started to believe that i am not using the correct urls or maybe there is a restriction in using urllib that slipped my attention. Here is the code that i struggle with for weeks now:
import re
from math import ceil
from urllib.request import urlopen as uReq, Request
from bs4 import BeautifulSoup as soup
InitUrl = "https://mtgsingles.gr/search?q=dragon"
NumOfCrawledPages = 0
URL_Next = ""
NumOfPages = 4 # depth of pages to be retrieved
query = InitUrl.split("?")[1]
for i in range(0, NumOfPages):
if i == 0:
Url = InitUrl
else:
Url = URL_Next
print(Url)
UClient = uReq(Url) # downloading the url
page_html = UClient.read()
UClient.close()
page_soup = soup(page_html, "html.parser")
cards = page_soup.findAll("div", {"class": ["iso-item", "item-row-view"]})
for card in cards:
card_name = card.div.div.strong.span.contents[3].contents[0].replace("\xa0 ", "")
if len(card.div.contents) > 3:
cardP_T = card.div.contents[3].contents[1].text.replace("\n", "").strip()
else:
cardP_T = "Does not exist"
cardType = card.contents[3].text
print(card_name + "\n" + cardP_T + "\n" + cardType + "\n")
try:
URL_Next = InitUrl + "&page=" + str(i + 2)
print("The next URL is: " + URL_Next + "\n")
except IndexError:
print("Crawling process completed! No more infomation to retrieve!")
else:
NumOfCrawledPages += 1
Url = URL_Next
finally:
print("Moving to page : " + str(NumOfCrawledPages + 1) + "\n")

One of the reasons your code fail is, that you don't use cookies. The site seem to require these to allow paging.
A clean and simple way of extracting the data you're interested in would be like this:
import requests
from bs4 import BeautifulSoup
# the site actually uses this url under the hood for paging - check out Google Dev Tools
paging_url = "https://mtgsingles.gr/search?ajax=products-listing&lang=en&page={}&q=dragon"
return_list = []
# the page-scroll will only work when we support cookies
# so we fetch the page in a session
session = requests.Session()
session.get("https://mtgsingles.gr/")
All pages have a next button except the last one. So we use this knowledge to loop until the next-button goes away. When it does - meaning that the last page is reached - the button is replaced with a 'li'-tag with the class of 'next hidden'. This only exists on the last page
Now we're ready to start looping
page = 1 # set count for start page
keep_paging = True # use flag to end loop when last page is reached
while keep_paging:
print("[*] Extracting data for page {}".format(page))
r = session.get(paging_url.format(page))
soup = BeautifulSoup(r.text, "html.parser")
items = soup.select('.iso-item.item-row-view.clearfix')
for item in items:
name = item.find('div', class_='col-md-10').get_text().strip().split('\xa0')[0]
toughness_element = item.find('div', class_='card-power-toughness')
try:
toughness = toughness_element.get_text().strip()
except:
toughness = None
cardtype = item.find('div', class_='cardtype').get_text()
card_dict = {
"name": name,
"toughness": toughness,
"cardtype": cardtype
}
return_list.append(card_dict)
if soup.select('li.next.hidden'): # this element only exists if the last page is reached
keep_paging = False
print("[*] Scraper is done. Quitting...")
else:
page += 1
# do stuff with your list of dicts - e.g. load it into pandas and save it to a spreadsheet
This will scroll until no more pages exists - no matter how many subpages would be in the site.
My point in the comment above was merely that if you encounter an Exception in your code, your pagecount would never increase. That's probably not what you want to do, which is why I recommended you to learn a little more about the behaviour of the whole try-except-else-finally deal.

I am also bluffed, by the request given the same reply, ignoring the page parameter. As a dirty soulution I can offer you first to set up the page-size to a high enough number to get all the Items that you want (this parameter works for some reason...)
import re
from math import ceil
import requests
from bs4 import BeautifulSoup as soup
InitUrl = Url = "https://mtgsingles.gr/search"
NumOfCrawledPages = 0
URL_Next = ""
NumOfPages = 2 # depth of pages to be retrieved
query = "dragon"
cardSet=set()
for i in range(1, NumOfPages):
page_html = requests.get(InitUrl,params={"page":i,"q":query,"page-size":999})
print(page_html.url)
page_soup = soup(page_html.text, "html.parser")
cards = page_soup.findAll("div", {"class": ["iso-item", "item-row-view"]})
for card in cards:
card_name = card.div.div.strong.span.contents[3].contents[0].replace("\xa0 ", "")
if len(card.div.contents) > 3:
cardP_T = card.div.contents[3].contents[1].text.replace("\n", "").strip()
else:
cardP_T = "Does not exist"
cardType = card.contents[3].text
cardString=card_name + "\n" + cardP_T + "\n" + cardType + "\n"
cardSet.add(cardString)
print(cardString)
NumOfCrawledPages += 1
print("Moving to page : " + str(NumOfCrawledPages + 1) + " with " +str(len(cards)) +"(cards)\n")

selenium python clicking a href by text inside

I am trying to switch countries programmatically in this site for some automation testing, the prices are different in each country so I am programming a little tool to help me decide where to buy from.
First, I get all the currencies into a list by doing this:
def get_all_countries():
one = WebDriverWait(driver1, 10).until(EC.element_to_be_clickable((By.CLASS_NAME, "selected-currency")))
one.click()
el = WebDriverWait(driver1, 10).until(EC.visibility_of_element_located((By.CLASS_NAME, "site-selector-list")))
list_return = []
a_tags = el.find_elements_by_tag_name('a')
for a in a_tags:
list_return.append(a.text)
return list_return
For example, it returns: ['United Kingdom', 'United States', 'France', 'Deutschland', 'España', 'Australia', 'Россия'] and then, I iterate through the list and each time calling this function:
def set_country(text):
is_change_currency_displayed = driver1.find_element_by_id("siteSelectorList").is_displayed()
if not is_change_currency_displayed: # get_all_countries function leaves dropdown open. Check if it is open before clicking it.
one = WebDriverWait(driver1, 10).until(EC.element_to_be_clickable((By.CLASS_NAME, "selected-currency")))
one.click()
div = WebDriverWait(driver1, 10).until(EC.visibility_of_element_located((By.CLASS_NAME, "site-selector-list")))
a_tags = div.find_elements_by_tag_name('a')
for a in a_tags:
try:
if a.text == text:
driver1.get(a.get_attribute("href"))
except StaleElementReferenceException:
set_country(text)
When comparing a.text to text, I got a StaleElementReferenceException, I read online that it means the object is changed from when I saved it, and a simple solution is to call the function again. However, I don't like this solution and this code a lot, I think it is not effective and takes too much time, any ideas?
EDIT:
def main(url):
driver1.get(url)
to_return_string = ''
one = WebDriverWait(driver1, 10).until(EC.element_to_be_clickable((By.CLASS_NAME, "selected-currency")))
one.click()
el = WebDriverWait(driver1, 10).until(EC.visibility_of_element_located((By.CLASS_NAME, "site-selector-list")))
a_tags = el.find_elements_by_tag_name('a')
for a in a_tags:
atext = a.text
ahref = a.get_attribute('href')
try:
is_change_currency_displayed = driver1.find_element_by_id("siteSelectorList").is_displayed()
if not is_change_currency_displayed: # get_all_countries function leaves dropdown open.
one = WebDriverWait(driver1, 10).until(EC.element_to_be_clickable((By.CLASS_NAME, "selected-currency")))
one.click()
driver1.get(ahref)
current_price = WebDriverWait(driver1, 10).until(
EC.visibility_of_element_located((By.CSS_SELECTOR, ".current-price")))
to_return_string += ("In " + atext + " : " + current_price.text + ' \n')
print("In", atext, ":", current_price.text)
except TimeoutException:
print("In", atext, ":", "Timed out waiting for page to load")
to_return_string += ("In " + atext + " : " + " Timed out waiting for page to load" + ' \n')
return to_return_string
main('http://us.asos.com/asos//prd/7011279')

If I understand the problem statement correctly, Adding break statement solves the problem:
def set_country(text):
is_change_currency_displayed = driver1.find_element_by_id("siteSelectorList").is_displayed()
if not is_change_currency_displayed: # get_all_countries function leaves dropdown open. Check if it is open before clicking it.
one = WebDriverWait(driver1, 10).until(EC.element_to_be_clickable((By.CLASS_NAME, "selected-currency")))
one.click()
div = WebDriverWait(driver1, 10).until(EC.visibility_of_element_located((By.CLASS_NAME, "site-selector-list")))
a_tags = div.find_elements_by_tag_name('a')
for a in a_tags:
try:
if a.text == text:
driver1.get(a.get_attribute("href"))
break
except StaleElementReferenceException:
set_country(text)
DOM is updated once driver.get is called. so, the references related to old page (i.e., a_tags) won't work.
Instead, you should break the loop and come out as soon as the given country page is retrieved using driver.get once the condition is satisfied. So, you set the country you want and no need to iterate over and over again to check if condition, which obviously results in StaleElementReferenceException.

If your stale element is the a tag and not the div, you can iterate over the a tags length and get each element's text through the div:
for i in range(len(div.find_elements_by_tag_name('a')):
if div.find_elements_by_tag_name('a')[i].text == text:
driver1.get(div.find_elements_by_tag_name('a')[i].get_attribute("href"))
That way you can the most recent element from the DOM.
If your stale element is the div then you'll need to verify that the drop down isn't disappearing after your one.click() with hovering it or some other way.
Another approach would be to change your a.text to have a wait:
wait = WebDriverWait(driver, 10, poll_frequency=1, ignored_exceptions=[StaleElementReferenceException])
a = wait.until(EC.text_to_be_present_in_element((By.YourBy)))

A way to match some fields from different pages using selenium webdriver in python

I am trying to access different pages, insert a name / email in some fields then press a button to submit those fields.
Now, I kind of found a way to match email / name on all pages using webdriver even if they are different in their html structure. I am using the following pieces of code:
import logging
from selenium.common.exceptions import ErrorInResponseException, \
WebDriverException
from selenium.webdriver.common.keys import Keys
from pyvirtualdisplay import Display
from selenium import webdriver
import lxml.html
import urlparse
import time
import re
def subscribe(email, name):
display = Display(visible=0, size=(800, 600))
dom = lxml.html.parse('http://muncheye.com')
url = dom.docinfo.URL
driver = webdriver.Chrome()
failed_urls = []
i = 0
to_visit_urls = dom.xpath('//div[#id="right-column"]//a/#href')
print(len(to_visit_urls))
"""
Visit each url. Check to be alive. Search form.
"""
for link in to_visit_urls:
not_found = False
name_required = True
email_required = True
button_required = True
dom1 = lxml.html.parse(urlparse.urljoin(url, link))
submit_url = dom1.xpath(
'//div[#class="product_info"]//table//tr[7]//td[2]//a/#href')[0]
if re.match('https?://(?:www\.|(?!www))[^\s\.]+\.[^\s]{2,}|www\.['
'^\s]+\.[^\s]{2,}', submit_url):
time.sleep(10)
try:
driver.get(submit_url)
try:
name_box = driver.find_element_by_xpath(
"//input[#*[contains(translate(., "
"'ABCDEFGHIJKLMNOPQRSTUVWXYZ', "
"'abcdefghijklmnopqrstuvwxyz'), 'name')]]")
name_box.click()
name_box.clear()
name_box.send_keys(email)
except Exception:
not_found = True
try:
email_box = driver.find_element_by_xpath(
"//input[#*[contains(translate(., "
"'ABCDEFGHIJKLMNOPQRSTUVWXYZ', "
"'abcdefghijklmnopqrstuvwxyz'), 'email')]]")
email_box.click()
email_box.clear()
email_box.send_keys(email)
except Exception:
not_found = True
if not_found:
i += 1
print "here" + " = " + str(i) + " link = " + str(submit_url)
for element in driver.find_elements_by_xpath(
"//input[#type='text']"):
if name_required:
try:
name_box = element.find_element_by_xpath(
".[#*[contains(translate(., "
"'ABCDEFGHIJKLMNOPQRSTUVWXYZ', "
"'abcdefghijklmnopqrstuvwxyz'), 'name')]]")
name_box.click()
name_box.clear()
name_box.send_keys(name)
name_required = False
continue
except Exception:
pass
if email_required:
try:
email_box = element.find_element_by_xpath(
".[#*[contains(translate(., "
"'ABCDEFGHIJKLMNOPQRSTUVWXYZ', "
"'abcdefghijklmnopqrstuvwxyz'), 'email')]]")
email_box.click()
email_box.clear()
email_box.send_keys(email)
email_required = False
break
except Exception:
pass
if (not name_required) and (not email_required) and (
not button_required):
break
for element1 in driver.find_elements_by_xpath(
"//*[#type[translate(., 'ABCDEFGHIJKLMNOPQRSTUVWXYZ', "
"'abcdefghijklmnopqrstuvwxyz') = 'submit']]["
"preceding::*[#name[translate(., "
"'ABCDEFGHIJKLMNOPQRSTUVWXYZ', "
"'abcdefghijklmnopqrstuvwxyz') ='email' or translate("
"., 'ABCDEFGHIJKLMNOPQRSTUVWXYZ', "
"'abcdefghijklmnopqrstuvwxyz') ='name']]]"):
if button_required:
try:
button = element1.find_element_by_xpath(
"//*[#type[translate(., "
"'ABCDEFGHIJKLMNOPQRSTUVWXYZ', "
"'abcdefghijklmnopqrstuvwxyz') = 'submit']]["
"preceding::*[#name[translate(., "
"'ABCDEFGHIJKLMNOPQRSTUVWXYZ', "
"'abcdefghijklmnopqrstuvwxyz') ='email' or "
"translate(., 'ABCDEFGHIJKLMNOPQRSTUVWXYZ', "
"'abcdefghijklmnopqrstuvwxyz') "
"='name']]]").click()
element1.click()
element1.send_keys(Keys.ENTER)
element1 = False
continue
except Exception:
try:
element1.find_element_by_xpath(
"//*[#name='email' or "
"#name='name']//following::*["
"#type='submit']/a").click()
element1.click()
element1.send_keys(Keys.ENTER)
button_required = False
except Exception:
pass
except WebDriverException:
logging.exception('Chrome crashed')
driver.close()
driver = webdriver.Chrome()
to_visit_urls.append(link)
except Exception as e:
logging.exception("Fail here:{0}".format(submit_url))
failed_urls.append(submit_url)
pass # this 'pass' is here because when the script passed
# from link 33, it gives me fail on all of them
time.sleep(5)
print button_required
return failed_urls
print subscribe('hfbfsdfsdf#freeletter.me', 'hfbfsdfsdf#freeletter.me')
Now, I don't know if the problem is from the source code, or webdriver / xpath, but when the button is trying to submit those fields I don't think it's found on the page because I only get 5/6 emails from 100 available links.
Now, the question: can anyone give me a better xpath expression that will be able to press a button / fill name / email fields if the pages are different from one to another ?

The more proper way to do this is to look for each pair of elements in some specific way based on each page. Do this until you actually find the elements and then do stuff with them. I don't know what the HTML looks like so I can't show you actual code.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Python- Selenium/BeautifulSoup PDF & Table scraper - python

Related

Can I pause a scroll function in selenium, scrape the current data, and then continue scrolling later in the script?

How do I export data from multiple pages into a csv file?

Web crawler not able to process more than one webpage

selenium python clicking a href by text inside

A way to match some fields from different pages using selenium webdriver in python

Categories

Resources