I am trying to send several items from a CSV file to a webform using python so I don't have to type it all in by hand, especially when I update the sheet later. I tried using the answer to this question and the page comes up and seems to "submit" but I get told the import failed.
My Code
from bs4 import BeautifulSoup
from requests import get
from selenium import webdriver
from selenium.webdriver.common.keys import Keys
import time
import pandas as pd
# using Pandas to read the csv file
source_information = pd.read_csv('C:/chrome_driver/test_csv.csv', header=None, skiprows=[0])
print(source_information)
# setting the URL for BeautifulSoup to operate in
url = "https://www.roboform.com/filling-test-all-fields"
my_web_form = get(url).content
soup = BeautifulSoup(my_web_form, 'html.parser')
# creating a procedure to fill the form
def fulfill_form(first, email):
# Setting parameters for selenium to work
path = r'C:/chrome_driver/chromedriver.exe'
options = webdriver.ChromeOptions()
options.add_argument("--start-maximized")
driver = webdriver.Chrome(path, options=options)
driver.get(url)
# use Chrome Dev Tools to find the names or IDs for the fields in the form
input_first = driver.find_element_by_name('02frstname')
input_email = driver.find_element_by_name('24emailadr')
submit = driver.find_element_by_name('Reset')
# input the values and hold a bit for the next action
input_first.send_keys(first)
time.sleep(1)
input_email.send_keys(email)
time.sleep(5)
submit.click()
time.sleep(7)
# creating a list to hold any entries should them result in error
failed_attempts = []
# creating a loop to do the procedure and append failed cases to the list
for customer in source_information:
try:
fulfill_form(str(source_information[0]), str(source_information[1]))
except:
failed_attempts.append(source_information[0])
pass
if len(failed_attempts) > 0:
print("{} cases have failed".format(len(failed_attempts)))
print("Procedure concluded")
This tells me that "2 cases have failed"
I checked the output of my "source_information" and it shows the following
0 1
0 Corey corey#test.com
1 Breana breana#hello.org
Where am I going wrong?
Maybe:
submit = driver.find_element_by_name('Reset')
Should be...
submit = driver.find_element_by_xpath("//input[#type='reset' and #value='Reset']")
Based on the page source of (it doesn't have a name)...
<input type="reset" value="Reset">
...and note the type reset vs the value Reset.
Then you have source_information as a dataframe so you probably want to change...
# creating a loop to do the procedure and append failed cases to the list
for customer in source_information:
try:
fulfill_form(str(source_information[0]), str(source_information[1]))
except:
failed_attempts.append(source_information[0])
pass
To something like...
# creating a loop to do the procedure and append failed cases to the list
for customer in source_information.iterrows():
try:
fulfill_form(customer[1][0], customer[1][1])
except:
failed_attempts.append(source_information[1][0])
pass
I'd also suggest changing all your time.sleep(5) and time.sleep(7) to 1 or 2 so it runs a little quicker.
Obviously this is all from looking at the code without running your data and seeing what happens.
Additional:
I reread the question and you do have an example of test data from the failures. Running this for the changes shown above works.
Related
I am using this to code to scrape emails from google search results. However, it only scrapes the first 10 results despite having 100 search results loaded.
Ideally, I would like for it to scrape all search results.
Is there a reason for this?
from selenium import webdriver
import time
import re
import pandas as pd
PATH = 'C:\Program Files (x86)\chromedriver.exe'
l=list()
o={}
target_url = "https://www.google.com/search?q=solicitors+wales+%27email%27+%40&rlz=1C1CHBD_en-GBIT1013IT1013&sxsrf=AJOqlzWC1oRbVtWcmcIgC4-3ZnGkQ8sP_A%3A1675764565222&ei=VSPiY6WeDYyXrwStyaTwAQ&ved=0ahUKEwjlnIy9lYP9AhWMy4sKHa0kCR4Q4dUDCA8&uact=5&oq=solicitors+wales+%27email%27+%40&gs_lcp=Cgxnd3Mtd2l6LXNlcnAQAzIFCAAQogQyBwgAEB4QogQyBQgAEKIESgQIQRgASgQIRhgAUABYAGD4AmgAcAF4AIABc4gBc5IBAzAuMZgBAKABAcABAQ&sclient=gws-wiz-serp"
driver=webdriver.Chrome(PATH)
driver.get(target_url)
email_pattern = r"[A-Za-z0-9._%+-]+#[A-Za-z0-9.-]+\.[A-Z|a-z]{2,4}"
html = driver.page_source
emails = re.findall(email_pattern, html)
time.sleep(10)
df = pd.DataFrame(emails, columns=['Email Addresses'])
df.to_excel('email_addresses_.xlsx',index=False)
# print(emails)
driver.close()
The code is working as expected and scraping 10 results which is the default from google search. You can use the methods like 'find_element_by_xpath' to find the next button and click it.
This operation needs to be done till the sufficient results are collected in loop. Refer this for more details selenium locating elements
How to use the selenium commands, probably you can look upto web. I found one similar question which can provide some reference
Following up on Bijendra's answer,
you could update the code as below:
from selenium import webdriver
from selenium.webdriver.common.by import By
import time
import re
import pandas as pd
PATH = 'C:\Program Files (x86)\chromedriver.exe'
l=list()
o={}
target_url = "https://www.google.com/search?q=solicitors+wales+%27email%27+%40&rlz=1C1CHBD_en-GBIT1013IT1013&sxsrf=AJOqlzWC1oRbVtWcmcIgC4-3ZnGkQ8sP_A%3A1675764565222&ei=VSPiY6WeDYyXrwStyaTwAQ&ved=0ahUKEwjlnIy9lYP9AhWMy4sKHa0kCR4Q4dUDCA8&uact=5&oq=solicitors+wales+%27email%27+%40&gs_lcp=Cgxnd3Mtd2l6LXNlcnAQAzIFCAAQogQyBwgAEB4QogQyBQgAEKIESgQIQRgASgQIRhgAUABYAGD4AmgAcAF4AIABc4gBc5IBAzAuMZgBAKABAcABAQ&sclient=gws-wiz-serp"
driver=webdriver.Chrome(PATH)
driver.get(target_url)
emails = []
email_pattern = r"[A-Za-z0-9._%+-]+#[A-Za-z0-9.-]+\.[A-Z|a-z]{2,4}"
for i in range(2):
html = driver.page_source
for e in re.findall(email_pattern, html):
emails.append(e)
a_attr = driver.find_element(By.ID,"pnnext")
a_attr.click()
time.sleep(2)
df = pd.DataFrame(emails, columns=['Email Addresses'])
df.to_csv('email_addresses_.csv',index=False)
driver.close()
You could either change the range value passed in for loop or entirely replace the for loop with while loop so instead of
for i in range(2):
You could do:
while len(emails) < 100:
Make sure to manage the time as to when the page navigates to next page and wait for the next page to load before extracting the available emails and then moving on to clicking the next button on search result page.
Make sure to refer to docs to get a clear idea of what you should do to achieve what you want to. Happy Hacking!!
Selenium loads its own empty browser so your google settings for 100 results need to be on the code because the default is 10 results which is what your getting. You will have better luck using query parameters and adding the one for the number of results to the end of your URL
If you need further information on query parameters to achieve this its the second method described below
tldevtech.com/how-to-show-100-results-per-page-in-google-search
I'm trying to scrape data from a paginated table. The table can only be accessed by logging in to a user account. I've decided to approach this using Selenium to log in. I then hope to be able to read this into a Pandas DataFrame. I plan on using BeautifulSoup as a go between.
Here is my code:
from selenium import webdriver
import time
import pandas as pd
from bs4 import BeautifulSoup
url = "https://www.example.com/userarea"
driver = webdriver.Chrome()
time.sleep(6)
driver.get(url)
time.sleep(6)
username = driver.find_element_by_id("user")
username.clear()
username.send_keys("xyz#email.com")
password = driver.find_element_by_id("password")
password.clear()
password.send_keys('password')
driver.find_element_by_xpath('//button[]').click()
driver.find_element_by_xpath('//button[text()="Log in"]').click()
time.sleep(6)
driver.find_element_by_xpath('//span[text()="Text"]').click()
driver.find_element_by_xpath('//span[text()="Text"]').click()
html = driver.page_source
soup = BeautifulSoup(html,'html.parser')
try:
tables = soup.find_all('th')
print(tables) #Returns an empty list
df = pd.read_html(str(tables))
df.head()
except:
driver.close()
driver.close()
Unfortunately, this is only printing an empty list. I've tried using lxml too but no joy.
Using the inspection tools it does seem that there aren't any table tags, so I tried to find all <th> tags instead (which definitely are present). Again no joy. I've not yet tried to work through the individual pages. I only mention the pagination in case it offers a clue to the issue.
Any idea what I'm missing?
Thank you to those that offered suggestions. In the end furas' suggestion was best placed and it turned out the script was running too quickly. I paused Python for 6 seconds after clicking on the page with the table on. Seems to run on javascript and I can actually see the values pop into place now as the script works through the pagination.
import time
#Navigate to page, then let it load using:
time.sleep(6)
Would appreciate a help with selenium.
Trying to fill in google form for several entries, so that I need to input the feirst row a df, than click "Submit" a new form and run again for the second form and to the n-th row.
Got stuck with NoSuchFrameException: Unable to locate frame with index error after the first entry. Read on Selenium docs that one can locate a window's frame in console and it gives me nothing (F12 --> find frame (any combination tried) --> no matches). No such thing in google form (or my search is wrong hands down)
Haven't got anything on the issue so tried frame(0) - no luck.
Any tips would be appreciated. The whole code is below
from selenium import webdriver
from selenium.webdriver.firefox.options import Options
from selenium.webdriver.firefox.firefox_binary import FirefoxBinary
import time
import pandas as pd
options = Options()
options.binary_location = FirefoxBinary(r"C:\Program Files\Mozilla Firefox\firefox.exe")
driver = webdriver.Firefox(executable_path=r'C:\WebDriver\bin\geckodriver.exe', firefox_options=options)
driver.implicitly_wait(10)
reg = pd.read_csv(r'C:\Users\User\Desktop\Form.csv', header=0, delimiter=';', sep=r'\s*;\s*')
reg_2 = reg.values.tolist()
driver.get('https://docs.google.com/forms/d/e/1FAIpQLSd9FQ33H5SMHelf9O1jjHl7FtLTtaTdFuC4dUFv-educaFiJA/viewform?vc=0&c=0&w=1&flr=0&gxids=7628')
try:
for row in reg_2:
element_count = 0
for element in range(len(row)):
first = driver.find_element_by_xpath("/html/body/div/div[2]/form/div[2]/div/div[2]/div[2]/div/div/div[2]/div/div[1]/div/div[1]/input")
last = driver.find_element_by_xpath("/html/body/div/div[2]/form/div[2]/div/div[2]/div[1]/div/div/div[2]/div/div[1]/div/div[1]/input")
mail = driver.find_element_by_xpath("/html/body/div/div[2]/form/div[2]/div/div[2]/div[3]/div/div/div[2]/div/div[1]/div/div[1]/input")
last.send_keys(row[0])
first.send_keys(row[1])
mail.send_keys(row[2])
submit = driver.find_element_by_xpath('//*[#id="mG61Hd"]/div[2]/div/div[3]/div[1]/div/div/span/span')
submit.click()
time.sleep(3)
element_count +=1
driver.switch_to.frame(0)
driver.switch_to.default_content()
finally:
driver.quit()
driver.switch_to.frame(0)
driver.switch_to.default_content()
remove this two line of code that google form doesn't have any iframe in it
if you want to submit agian click the submit another response link:
driver.find_element_by_xpath('//a[contains(text(),"Submit another response")]').click()
Well, in fact after deleting the old form and starting anew the thing worked in the end. Had to change several other elements. Also added password and password_confirm field:
while len(reg_2) > element_count:
try:
for row in reg_2:
first = driver.find_element_by_xpath("/html/body/div/div[2]/form/div[2]/div/div[2]/div[2]/div/div/div[2]/div/div[1]/div/div[1]/input")
last = driver.find_element_by_xpath("/html/body/div/div[2]/form/div[2]/div/div[2]/div[1]/div/div/div[2]/div/div[1]/div/div[1]/input")
mail = driver.find_element_by_xpath("/html/body/div/div[2]/form/div[2]/div/div[2]/div[3]/div/div/div[2]/div/div[1]/div/div[1]/input")
password = driver.find_element_by_xpath('/html/body/div/div[2]/form/div[2]/div/div[2]/div[4]/div/div/div[2]/div/div[1]/div/div[1]/input')
password_confirm = driver.find_element_by_xpath('/html/body/div/div[2]/form/div[2]/div/div[2]/div[5]/div/div/div[2]/div/div[1]/div/div[1]/input')
last.send_keys(row[0])
first.send_keys(row[1])
mail.send_keys(row[2])
password.send_keys(row[3])
password_confirm.send_keys(row[3])
submit = driver.find_element_by_xpath('//*[#id="mG61Hd"]/div[2]/div/div[3]/div[1]/div/div/span/span')
submit.click()
time.sleep(3)
element_count +=1
#driver.find_element_by_xpath('/html/body/div[1]/div[2]/div[1]/div/div[4]/a').click()
driver.find_element_by_css_selector('.freebirdFormviewerViewResponseLinksContainer > a:nth-child(1)').click()
finally:
driver.quit()
I've been following along this guide to web scraping LinkedIn and google searches. There have been some changes in the HTML of google's search results since the guide was created so I've had to tinker with the code a bit. I'm at the point where I need to grab the links from the search results but have run into an issue where the program doesn't return anything even after implementing a code fix from this post due to an error. I'm not sure what I'm doing wrong here.
import Parameters
from time import sleep
from selenium import webdriver
from selenium.webdriver.common.keys import Keys
from parsel import Selector
import csv
# defining new variable passing two parameters
writer = csv.writer(open(Parameters.file_name, 'w'))
# writerow() method to the write to the file object
writer.writerow(['Name', 'Job Title', 'Company', 'College', 'Location', 'URL'])
# specifies the path to the chromedriver.exe
driver = webdriver.Chrome('/Users/.../Python Scripts/chromedriver')
driver.get('https://www.linkedin.com')
sleep(0.5)
# locate email form by_class_name then send_keys() to simulate key strokes
username = driver.find_element_by_id('session_key')
username.send_keys(Parameters.linkedin_username)
sleep(0.5)
password = driver.find_element_by_id('session_password')
password.send_keys(Parameters.linkedin_password)
sleep(0.5)
sign_in_button = driver.find_element_by_class_name('sign-in-form__submit-button')
sign_in_button.click()
sleep(3)
driver.get('https:www.google.com')
sleep(3)
search_query = driver.find_element_by_name('q')
search_query.send_keys(Parameters.search_query)
sleep(0.5)
search_query.send_keys(Keys.RETURN)
sleep(3)
################# HERE IS WHERE THE ISSUE LIES ######################
#linkedin_urls = driver.find_elements_by_class_name('iUh30')
linkedin_urls = driver.find_elements_by_css_selector("yuRUbf > a")
for url_prep in linkedin_urls:
url_prep.get_attribute('href')
#linkedin_urls = [url.text for url in linkedin_urls]
sleep(0.5)
print('Supposed to be URLs')
print(linkedin_urls)
The search parameter is
search_query = 'site:linkedin.com/in/ AND "python developer" AND "London"'
Results in an empty list:
Snippet of the HTML section I want to grab:
EDIT: This is the output if I go by .find_elements_by_class_name or by Sector97's 1st edits.
Found an alternative solution that might make it a bit easier to achieve what you're after. Credit to A.Pond at
https://stackoverflow.com/a/62050505
Use the google search api to get the links from the results.
You may need to install the library first
pip install google
You can then use the api to quickly extract an arbitrary number of links:
from googlesearch import search
links = []
query = 'site:linkedin.com/in AND "python developer" AND "London"'
for j in search(query, tld = 'com',start = 0,stop = 100,pause=4):
links.append(j)
I got the first 100 results but you can play around with the parameters to get more or less as you need.
You can see more about this api here:
https://www.geeksforgeeks.org/performing-google-search-using-python-code/
I think I found the error in your code.
Instead of using
linkedin_urls = driver.find_elements_by_css_selector("yuRUbf > a")
Try this instead:
web_elements = driver.find_elements_by_class_name("yuRUbf")
That gets you the parent elements. You can then extract the url text using a simple list comprehension:
linkedin_urls = [elem.find_element_by_css_selector('a').get_attribute('href') for elem in web_elements]
I'm using selenium and BeautifulSoup to scrape data from a website (http://www.grownjkids.gov/ParentsFamilies/ProviderSearch) with a next button, which I'm clicking in a loop. I was struggling with StaleElementReferenceException previously but overcame this by looping to refind the element on the page. However, I ran into a new problem - it's able to click all the way to the end now. But when I check the csv file it's written to, even though the majority of the data looks good, there's often duplicate rows in batches of 5 (which is the number of results that each page shows).
Pictoral example of what I mean: https://www.dropbox.com/s/ecsew52a25ihym7/Screen%20Shot%202019-02-13%20at%2011.06.41%20AM.png?dl=0
I have a hunch this is due to my program re-extracting the current data on the page every time I attempt to find the next button. I was confused why this would happen, since from my understanding, the actual scraping part happens only after you break out of the inner while loop which attempts to find the next button and into the larger one. (Let me know if I'm not understanding this correctly as I'm comparatively new to this stuff.)
Additionally, the data I output after every run of my program is different (which makes sense considering the error, since in the past, the StaleElementReferenceExceptions were occurring at sporadic locations. If it duplicates results every time this exception occurs, it would make sense for duplications to occur sporadically as well. Even worse, a different batch of results ends up being skipped each time I run the program as well - I cross-compared results from 2 different outputs and there were some results that were present in one and not the other.
from selenium import webdriver
from selenium.webdriver.chrome.options import Options
from selenium.common.exceptions import NoSuchElementException, StaleElementReferenceException
from bs4 import BeautifulSoup
import csv
chrome_options = Options()
chrome_options.add_argument('--disable-gpu')
chrome_options.add_argument("--headless")
url = "http://www.grownjkids.gov/ParentsFamilies/ProviderSearch"
driver = webdriver.Chrome('###location###')
driver.implicitly_wait(10)
driver.get(url)
#clears text box
driver.find_element_by_class_name("form-control").clear()
#clicks on search button without putting in any parameters, getting all the results
search_button = driver.find_element_by_id("searchButton")
search_button.click()
df_list = []
headers = ["Rating", "Distance", "Program Type", "County", "License", "Program Name", "Address", "Phone", "Latitude", "Longitude"]
while True:
#keeps on clicking next button to fetch each group of 5 results
try:
nextButton = driver.find_element_by_class_name("next")
nextButton.send_keys('\n')
except NoSuchElementException:
break
except StaleElementReferenceException:
attempts = 0
while (attempts < 100):
try:
nextButton = driver.find_element_by_class_name("next")
if nextButton:
nextButton.send_keys('\n')
break
except NoSuchElementException:
break
except StaleElementReferenceException:
attempts += 1
#finds table of center data on the page
table = driver.find_element_by_id("results")
html_source = table.get_attribute('innerHTML')
soup = BeautifulSoup(html_source, "lxml")
#iterates through centers, extracting the data
for center in soup.find_all("div", {"class": "col-sm-7 fields"}):
mini_list = []
#all fields except latlong
for row in center.find_all("div", {"class": "field"}):
material = row.find("div", {"class": "value"})
if material is not None:
mini_list.append(material.getText().encode("utf8").strip())
#parses latlong from link
for link in center.find_all('a', href = True):
content = link['href']
latlong = content[34:-1].split(',')
mini_list.append(latlong[0])
mini_list.append(latlong[1])
df_list.append(mini_list)
#writes content into csv
with open ('output_file.csv', "wb") as f:
writer = csv.writer(f)
writer.writerow(headers)
writer.writerows(row for row in df_list if row)
Anything would help! If there's other recommendations you have about the way I've used selenium/BeautifulSoup/python in order to improve my programming for the future, I would appreciate it.
Thanks so much!
I would use selenium to grab the results count then do an API call to get the actual results. You can either, in case result count is greater than limit for pageSize argument of queryString for API, loop in batches and increment the currentPage argument until you have reached the total count, or, as I do below, simply request all results in one go. Then extract what you want from the json.
import requests
import json
from bs4 import BeautifulSoup as bs
from selenium import webdriver
initUrl = 'http://www.grownjkids.gov/ParentsFamilies/ProviderSearch'
driver = webdriver.Chrome()
driver.get(initUrl)
numResults = driver.find_element_by_css_selector('#totalCount').text
driver.quit()
newURL = 'http://www.grownjkids.gov/Services/GetProviders?latitude=40.2171&longitude=-74.7429&distance=10&county=&toddlers=false&preschool=false&infants=false&rating=&programTypes=&pageSize=' + numResults + '¤tPage=0'
data = requests.get(newURL).json()
You have a collection of dictionaries to iterate over in the response:
An example of writing out some values:
if(len(data)) > 0:
for item in data:
print(item['Name'], '\n' , item['Address'])
If you are worried about lat and long values you can grab them from one of the script tags when using selenium:
The alternate URL I use for XHR jQuery GET you can find by using dev tools (F12) on the page then refreshing the page with F5 and inspect the jquery requests made in the network tab:
You should read HTML contents inside every iteration of while loop. example below:
while counter < oage_number_limit:
counter = counter + 1
new_data = driver.page_source
page_contents = BeautifulSoup(new_data, 'lxml')