I am trying to make a web-scraper that fetches all the results of every google search, however it kept outputting a "web element reference not seen before error". I assume it's due to the code trying to find the element before the url loads, but i am not too sure how to fix it.
from selenium import webdriver
#number of pages
max_page = 5
#number of digits (ie: 2 is 1 digit, 10 is 2 digits)
max_dig = 1
#Open up firefox browser
driver = webdriver.Firefox()
#inputs search into google
question = input("\n What would you like to google today, but replace every space with a '+' (ie: search+this)\n\n")
search = []
#get multiple pages
for i in range(0, max_page + 1):
#inserts page number into google search
page_num = (max_dig - len(str(i))) * "0" + str(i)
#inserts search input and cycles through pages
url = "https://www.google.com/search?q="+ question +"&ei=LV-uXYrpNoj0rAGC8KSYCg&start="+ page_num +"0&sa=N&ved=0ahUKEwjKs8ie367lAhUIOisKHQI4CaM4ChDy0wMIiQE&biw=1356&bih=946"
#finds element in every search page
search+=(driver.find_elements_by_class_name('LC20lb'))
driver.get(url)
#print results
search_items = len(search)
for a in range(search_items):
#print the page number
print(type(search[a].text))
Traceback (most recent call last):
File "screwdriver.py", line 32, in <module>
print(type(search[b].text))
selenium.common.exceptions.NoSuchElementException: Message: Web element reference not seen before: 6187cf00-39c8-c14b-a2de-b1d24e965b65
Problem is that Selenium doesn't keep HTML which you found but rather referece to element on current page. When you load new page - get() - then reference try to find element on new page and it can't find it. You should get text (and any other information) from item before you load new page.
from selenium import webdriver
max_page = 5
driver = webdriver.Firefox()
question = input("\n What would you like to google today, but replace every space with a '+' (ie: search+this)\n\n")
search = []
for i in range(max_page+1):
page_num = str(i)
url = "https://www.google.com/search?q="+ question +"&ei=LV-uXYrpNoj0rAGC8KSYCg&start="+ page_num +"0&sa=N&ved=0ahUKEwjKs8ie367lAhUIOisKHQI4CaM4ChDy0wMIiQE&biw=1356&bih=946"
items = driver.find_elements_by_class_name('LC20lb')
for item in items:
search.append(item.text)
driver.get(url)
for item in search:
print(item)
Related
I need to scrap all the google reviews. There are 90,564 reviews in my page. However the code i wrote can scrap only top 9 reviews. The other reviews are not scraped.
The code is given below:
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
# specify the url of the business page on Google
url = 'https://www.google.com/maps/place/ISKCON+temple+Bangalore/#13.0098328,77.5510964,15z/data=!4m7!3m6!1s0x0:0x7a7fb24a41a6b2b3!8m2!3d13.0098328!4d77.5510964!9m1!1b1'
# create an instance of the Chrome driver
driver = webdriver.Chrome()
# navigate to the specified url
driver.get(url)
# Wait for the reviews to load
wait = WebDriverWait(driver, 20) # increased the waiting time
review_elements = wait.until(EC.presence_of_all_elements_located((By.CLASS_NAME, 'wiI7pd')))
# extract the text of each review
reviews = [element.text for element in review_elements]
# print the reviews
print(reviews)
# close the browser
driver.quit()
what should i edit/modify the code to extract all the reviews?
Here is the working code for you after launching the url
totalRev = "div div.fontBodySmall"
username = ".d4r55"
reviews = "wiI7pd"
wait = WebDriverWait(driver, 20)
totalRevCount = wait.until(EC.presence_of_element_located((By.CSS_SELECTOR, totalRev))).get_attribute("textContent").split(' ')[0].replace(',','').replace('.','')
print("totalRevCount - ", totalRevCount)
wait.until(EC.presence_of_element_located((By.CSS_SELECTOR, totalRev))).click()
mydict = {}
found = 0
while found < int(totalRevCount):
review_elements = wait.until(EC.presence_of_all_elements_located((By.CLASS_NAME, reviews)))
reviewer_names = wait.until(EC.presence_of_all_elements_located((By.CSS_SELECTOR, username)))
found = len(mydict)
for rev, name in zip(review_elements, reviewer_names):
mydict[name.text] = rev.text
if len(rev.text) == 0:
found = int(totalRevCount) + 1
break
for i in range(8):
ActionChains(driver).key_down(Keys.ARROW_DOWN).perform()
print("found - ", found)
print(mydict)
time.sleep(2)
Explanation -
Get the locators for user name and review since we are going to create a key-value pair which will be useful in creating a non-duplicate result
You need to first get the total number of reviews/ratings that are present for that given location.
Get the username and review for the "visible" part of the webpage and store it in the dictionary
Scroll down the page and wait a few seconds
Get the username and review again and add them to dictionary. Only new ones will be added
As soon as a review that has no text (only rating), the loop will close and you have your results.
NOTE - If you want all reviews irrespective of the review text present or not, you can remove the "if" loop
I think you'll need to scoll down at first, and the get all the reviews.
scroll_value = 230
driver.execute_script( 'window.scrollBy( 0, '+str(scroll_value)+ ' )' ) # to scroll by value
# to get the current scroll value on the y axis
scroll_Y = driver.execute_script( 'return window.scrollY' )
That might be because the elements don't get loaded elsewise.
Since they are over 90'000, you might consider scolling down a little, then getting the reviews, repeat.
Resource: https://stackoverflow.com/a/74508235/20443541
I am attempting to scrape data through multiple pages (36) from a website to gather the document number and the revision number for each available document and save it to two different lists. If I run the code block below for each individual page, it works perfectly. However, when I added the while loop to loop through all 36 pages, it will loop, but only the data from the first page is saved.
#sam.gov website
url = 'https://sam.gov/search/?index=sca&page=1&sort=-modifiedDate&pageSize=25&sfm%5Bstatus%5D%5Bis_active%5D=true&sfm%5BwdPreviouslyPerformedWrapper%5D%5BpreviouslyPeformed%5D=prevPerfNo%2F'
#webdriver
driver = webdriver.Chrome(options = options_, executable_path = r'C:/Users/439528/Python Scripts/Spyder/chromedriver.exe' )
driver.get(url)
#get rid of pop up window
WebDriverWait(driver, 10).until(EC.element_to_be_clickable((By.CSS_SELECTOR, '#sds-dialog-0 > button > usa-icon > i-bs > svg'))).click()
#list of revision numbers
revision_num = []
#empty list for all the WD links
WD_num = []
substring = '2015'
current_page = 0
while True:
current_page += 1
if current_page == 36:
#find all elements on page named "field name". For each one, get the text. if the text is 'Revision Date'
#then, get the 'sibling' element, which is the actual revision number. append the date text to the revision_num list.
elements = driver.find_elements_by_class_name('sds-field__name')
wd_links = driver.find_elements_by_class_name('usa-link')
for i in elements:
element = i.text
if element == 'Revision Number':
revision_numbers = i.find_elements_by_xpath("./following-sibling::div")
for x in revision_numbers:
a = x.text
revision_num.append(a)
#finding all links that have the partial text 2015 and putting the wd text into the WD_num list
for link in wd_links:
wd = link.text
if substring in wd:
WD_num.append(wd)
print('Last Page Complete!')
break
else:
#find all elements on page named "field name". For each one, get the text. if the text is 'Revision Date'
#then, get the 'sibling' element, which is the actual revision number. append the date text to the revision_num list.
elements = driver.find_elements_by_class_name('sds-field__name')
wd_links = driver.find_elements_by_class_name('usa-link')
for i in elements:
element = i.text
if element == 'Revision Number':
revision_numbers = i.find_elements_by_xpath("./following-sibling::div")
for x in revision_numbers:
a = x.text
revision_num.append(a)
#finding all links that have the partial text 2015 and putting the wd text into the WD_num list
for link in wd_links:
wd = link.text
if substring in wd:
WD_num.append(wd)
#click on next page
click_icon = WebDriverWait(driver, 5, 0.25).until(EC.visibility_of_element_located([By.ID,'bottomPagination-nextPage']))
click_icon.click()
WebDriverWait(driver, 5).until(EC.presence_of_element_located((By.ID, 'main-container')))
Things I've tried:
I added the WebDriverWait in order to slow the script down for the page to load and/or elements to be clickable/located
I declared the empty lists outside the loop so it does not overwrite over each iteration
I have edited the while loop multiple times to either count up to 36 (while current_page <37) or moved the counter to the top or bottom of the loop)
Any ideas? TIA.
EDIT: added screenshot of 'field name'
I have refactor your code and made things very simple.
driver = webdriver.Chrome(options = options_, executable_path = r'C:/Users/439528/Python Scripts/Spyder/chromedriver.exe' )
revision_num = []
WD_num = []
for page in range(1,37):
url = 'https://sam.gov/search/?index=sca&page={}&sort=-modifiedDate&pageSize=25&sfm%5Bstatus%5D%5Bis_active%5D=true&sfm%5BwdPreviouslyPerformedWrapper%5D%5BpreviouslyPeformed%5D=prevPerfNo%2F'.format(page)
driver.get(url)
if page==1:
WebDriverWait(driver, 10).until(EC.element_to_be_clickable((By.CSS_SELECTOR, '#sds-dialog-0 > button > usa-icon > i-bs > svg'))).click()
elements = WebDriverWait(driver, 10).until(EC.visibility_of_all_elements_located((By.XPATH,"//a[contains(#class,'usa-link') and contains(.,'2015')]")))
wd_links = WebDriverWait(driver, 10).until(EC.visibility_of_all_elements_located((By.XPATH,"//div[#class='sds-field__name' and text()='Revision Number']/following-sibling::div")))
for element in elements:
revision_num.append(element.text)
for wd_link in wd_links:
WD_num.append(wd_link.text)
print(revision_num)
print(WD_num)
if you know only 36 pages to iterate you can pass the value in the url.
wait for element visible using webdriverwait
construct your xpath in such a way so can identify element uniquely without if, but.
console output on my terminal:
I am trying to scrap Instagram by hash tag in this case dog using selenium
scroll to load images
get links of posts for loaded images
but I realized that most of the links are repeated (last 3 lines) I don't know what is the problem I even tried many libraries for Instagram scrapping but all of them either giving errors or don't search by hash tag.
I am trying to scrap Instagram to get image data for my Deep Learning classifier model
also I want to know if there are better methods for Instagram scraping
import time
from selenium import webdriver
from selenium.webdriver.common.keys import Keys
from selenium.webdriver.common.action_chains import ActionChains as AC
driver = webdriver.Edge("msedgedriver.exe")
driver.get("https://www.instagram.com")
tag = "dog"
numberOfScrolls = 70
### Login Section ###
time.sleep(3)
username_field = driver.find_element_by_xpath('//*[#id="loginForm"]/div/div[1]/div/label/input')
username_field.send_keys("myusername")
password_field = driver.find_element_by_xpath('//*[#id="loginForm"]/div/div[2]/div/label/input')
password_field.send_keys("mypassword")
time.sleep(1)
driver.find_element_by_xpath('//*[#id="loginForm"]/div/div[3]').click()
time.sleep(5)
### Scarping Section ###
link = "https://www.instagram.com/explore/tags/" + tag
driver.get(link)
time.sleep(5)
Links = []
for i in range(numberOfScrolls):
AC(driver).send_keys(Keys.END).perform() # scrolls to the bottom of the page
time.sleep(1)
for x in range(1, 8):
try:
row = driver.find_element_by_xpath(
'//*[#id="react-root"]/section/main/article/div[2]/div/div[' + str(i) + ']')
row = row.find_elements_by_tag_name("a")
for element in row:
if element.get_attribute("href") is not None:
print(element.get_attribute("href"))
Links.append(element.get_attribute("href"))
except:
continue
print(len(Links))
Links = list(set(Links))
print(len(Links))
it found what was my mistake
row=driver.find_element_by_xpath('//[#id="reactroot"]/section/main/article/div[2]/div/div[' + str(i) + ']')
specifically in this part str(i) it should be x instead of i thats why most of them where repeated
I've been trying to get all the hrefs of a news article home page. In the end, I want to create something that gives me the n-most used words from all the news articles. To do that, I figured I needed the hrefs first to then click on them one after another.
With a lot of help from another user of this platform, this is the code I've got right now:
from bs4 import BeautifulSoup
from selenium import webdriver
url = 'https://ad.nl'
# launch firefox with your url above
# note that you could change this to some other webdriver (e.g. Chrome)
driver = webdriver.Chrome()
driver.get(url)
# click the "accept cookies" button
btn = driver.find_element_by_name('action')
btn.click()
# grab the html. It'll wait here until the page is finished loading
html = driver.page_source
# parse the html soup
soup = BeautifulSoup(html.lower(), "html.parser")
articles = soup.findAll("article")
for i in articles:
article = driver.find_element_by_class_name('ankeiler')
hrefs = article.find_element_by_css_selector('a').get_attribute('href')
print(hrefs)
driver.quit()
It gives me the first href I think, but it won't iterate over the next ones. It just gives me the first href as many times as it has to iterate. Does anyone know how I make it go to the next href instead of being stuck on the first one?
PS. if anyone has some suggestions on how to further do my little project, feel free to share them as I have a lot of things yet to learn about Python and programming in general.
Instead of using beautiful soup, how about this?
articles = driver.find_elements_by_css_selector('article')
for i in articles:
href = i.find_element_by_css_selector('a').get_attribute('href')
print(href)
To improve on my previous answer, I have written a full solution to your problem:
from selenium import webdriver
url = 'https://ad.nl'
#Set up selenium driver
driver = webdriver.Chrome()
driver.get(url)
#Click the accept cookies button
btn = driver.find_element_by_name('action')
btn.click()
#Get the links of all articles
article_elements = driver.find_elements_by_xpath('//a[#class="ankeiler__link"]')
links = [link.get_attribute('href') for link in article_elements]
#Create a dictionary for every word in the articles
words = dict()
#Iterate through every article
for link in links:
#Get the article
driver.get(link)
#get the elements that are the body of the article
article_elements = driver.find_elements_by_xpath('//*[#class="article__paragraph"]')
#Initalise a empty string
article_text = ''
#Add all the text from the elements to the one string
for element in article_elements:
article_text+= element.text + " "
#Convert all character to lower case
article_text = article_text.lower()
#Remove all punctuation other than spaces
for char in article_text:
if ord(char) > 122 or ord(char) < 97:
if ord(char) != 32:
article_text = article_text.replace(char,"")
#Split the article into words
for word in article_text.split(" "):
#If the word is already in the article update the count
if word in words:
words[word] += 1
#Otherwise make a new entry
else:
words[word] = 1
#Print the final dictionary (Very large so maybe sort for most occurring words and display top 10)
#print(words)
#Sort words by most used
most_used = sorted(words.items(), key=lambda x: x[1],reverse=True)
#Print top 10 used words
print("TOP 10 MOST USED: ")
for i in range(10):
print(most_used[i])
driver.quit()
Works fine for me, let me know if you get any errors.
To get all hrefs in an article you can do:
hrefs = article.find_elements_by_xpath('//a')
#OR article.find_element_by_css_selector('a')
for href in hrefs:
print(href.get_attribute('href'))
To progress with the project though, maybe the bellow would help:
hrefs = article.find_elements_by_xpath('//a')
links = [href.get_attribute("href") for href in hrefs]
for link in link:
driver.get(link)
#Add all words in the article to a dictionary with the key being the words and
#the value being the number of times they occur
I'm writing a crawler using Selenium, Python and PhantomJS to use Google's reverse image search. So far I've successfully been able to upload an image and crawl the search results on the first page. However, when I try to click on the search results navigation, I'm getting a StaleElementReferenceError. I have read about it in many posts but still I could not implement the solution. Here is the code that breaks:
ele7 = browser.find_element_by_id("nav")
ele5 = ele7.find_elements_by_class_name("fl")
count = 0
for elem in ele5:
if count <= 2:
print str(elem.get_attribute("href"))
elem.click()
browser.implicitly_wait(20)
ele6 = browser.find_elements_by_class_name("rc")
for result in ele6:
f = result.find_elements_by_class_name("r")
for line in f:
link = line.find_elements_by_tag_name("a")[0].get_attribute("href")
links.append(link)
parsed_uri = urlparse(link)
domains.append('{uri.scheme}://{uri.netloc}/'.format(uri=parsed_uri))
count += 1
The code breaks at print str(elem.get_attribute("href")) . How can I solve this?
Thanks in advance.
Clicking a link will cause the browser to go to another page; make references to the elements in old page (ele5, elem) invalid.
Modify the code not to reference invalid elements.
For example, you can get urls before you visit other pages:
ele7 = browser.find_element_by_id("nav")
ele5 = ele7.find_elements_by_class_name("fl")
urls = [elem.get_attribute('href') for elem in ele5] # <-----
browser.implicitly_wait(20)
for url in urls[:2]: # <------
print url
browser.get(url) # <------ used `browser.get` instead of `click`.
# ; using `element.click` will cause the error.
ele6 = browser.find_elements_by_class_name("rc")
for result in ele6:
f = result.find_elements_by_class_name("r")
for line in f:
link = line.find_elements_by_tag_name("a")[0].get_attribute("href")
links.append(link)
parsed_uri = urlparse(link)
domains.append('{uri.scheme}://{uri.netloc}/'.format(uri=parsed_uri))