I'm writing a crawler using Selenium, Python and PhantomJS to use Google's reverse image search. So far I've successfully been able to upload an image and crawl the search results on the first page. However, when I try to click on the search results navigation, I'm getting a StaleElementReferenceError. I have read about it in many posts but still I could not implement the solution. Here is the code that breaks:
ele7 = browser.find_element_by_id("nav")
ele5 = ele7.find_elements_by_class_name("fl")
count = 0
for elem in ele5:
if count <= 2:
print str(elem.get_attribute("href"))
elem.click()
browser.implicitly_wait(20)
ele6 = browser.find_elements_by_class_name("rc")
for result in ele6:
f = result.find_elements_by_class_name("r")
for line in f:
link = line.find_elements_by_tag_name("a")[0].get_attribute("href")
links.append(link)
parsed_uri = urlparse(link)
domains.append('{uri.scheme}://{uri.netloc}/'.format(uri=parsed_uri))
count += 1
The code breaks at print str(elem.get_attribute("href")) . How can I solve this?
Thanks in advance.
Clicking a link will cause the browser to go to another page; make references to the elements in old page (ele5, elem) invalid.
Modify the code not to reference invalid elements.
For example, you can get urls before you visit other pages:
ele7 = browser.find_element_by_id("nav")
ele5 = ele7.find_elements_by_class_name("fl")
urls = [elem.get_attribute('href') for elem in ele5] # <-----
browser.implicitly_wait(20)
for url in urls[:2]: # <------
print url
browser.get(url) # <------ used `browser.get` instead of `click`.
# ; using `element.click` will cause the error.
ele6 = browser.find_elements_by_class_name("rc")
for result in ele6:
f = result.find_elements_by_class_name("r")
for line in f:
link = line.find_elements_by_tag_name("a")[0].get_attribute("href")
links.append(link)
parsed_uri = urlparse(link)
domains.append('{uri.scheme}://{uri.netloc}/'.format(uri=parsed_uri))
Related
I want to make a recommendation system for webtoon, so I am collecting webtoon data. Currently, I wrote a code to scrap the url of the toons on the Kakao Webtoon page.
def extract_from_page(page_link):
links = []
driver = webdriver.Chrome()
driver.get(page_link)
elems = driver.find_elements_by_css_selector(".h-full.relative")
for elem in elems:
link = elem.get_attribute('href')
if link:
links.append({'id': int(link.split('/')[-1]), 'link': link})
print(len(links))
return links
This code works in weekly page(https://webtoon.kakao.com/original-webtoon, https://webtoon.kakao.com/original-novel)
However, in page that shows finished toons(https://webtoon.kakao.com/original-webtoon?tab=complete), it only receives 13 urls for the 13 webtoons at the top of the page.
I found similar post(web scraping gives only first 4 elements on a page) and add scroll, but noting changed.
I would appreciate it if you could tell me the cause and solution.
Try like below.
driver.get("https://webtoon.kakao.com/original-webtoon?tab=complete")
wait = WebDriverWait(driver,30)
j = 1
for i in range(5):
# Wait for the elements to load/appear
wait.until(EC.presence_of_all_elements_located((By.XPATH, "//a[contains(#href,'content')]")))
# Get all the elements which contains href value
links = driver.find_elements(By.XPATH,"//a[contains(#href,'content')]")
# Iterate to print the links
for link in links:
print(f"{j} : {link.get_attribute('href')}")
j += 1
# Scroll to the last element of the list links
driver.execute_script("arguments[0].scrollIntoView(true);",links[len(links)-1])
Output:
1 : https://webtoon.kakao.com/content/%EB%B0%A4%EC%9D%98-%ED%96%A5/1532
2 : https://webtoon.kakao.com/content/%EB%B8%8C%EB%A0%88%EC%9D%B4%EC%BB%A42/596
3 : https://webtoon.kakao.com/content/%ED%86%A0%EC%9D%B4-%EC%BD%A4%ED%94%8C%EB%A0%89%EC%8A%A4/1683
...
I am trying to scrape Instagram Post on an Account but whenever I tell it to scroll down, the previous links disappears and new ones show up but never all in same position and now it's always capturing just 29 out of 1100 posts.
while(count<10):
for i in range(1,2):
#.execute_script("window.scrollTo(0, document.body.scrollHeight);")
self.browser.execute_script('window.scrollTo(0,document.body.scrollHeight)')
print('.', end="",flush=True)
time.sleep(2)
elements = self.browser.find_elements_by_xpath("//div[#class='v1Nh3 kIKUG _bz0w']")
hrefElements = self.browser.find_elements_by_xpath("//div[#class='v1Nh3 kIKUG _bz0w']/a")
elements_link = [x.get_attribute("href") for x in hrefElements]
i = 1
unique = 1
text_file = open("Passed.txt", "r")
lines = text_file.readlines()
text_file.close()
for elements in elements_link:
print(str(i)+'.',end ="",flush=True)
found = self.found(elements,lines)
if found==True:
pass
else:
with open('Passed.txt','a') as f:
f.write(elements+'\n')
unique+=1
i+=1
count+=1
print('-----------------------------------------------')
print('No. of unique Posts Captured : '+ str(unique))
print('-----------------------------------------------')
This is my code for loading the posts and capturing the links from the posts and saving it into another file so I won't have to rerun it everytime.
the found function
` def found(self,key,lines):
for i in lines:
if i == key + '\n':
return True
else:
return False
`
I am trying to capture 1100 posts
here's what happens everytime it scrolls down
then scrolling down this changes to
You should first find the links and then scroll the page down in order to save the links, scroll the page and get the links that show up scrolling the page. In this way you will save also the links that disappear scrolling the page. Here an example:
wait = WebDriverWait(self.browser, 10)
links = []
number_of_posts = 1100
while True:
hrefElements = wait.until(ec.visibility_of_all_elements_located((By.XPATH, "//div[#class='v1Nh3 kIKUG _bz0w']/a")))
elements_link = [x.get_attribute("href") for x in hrefElements]
for link in elements_link:
if link not in links:
links.append(link)
self.browser.execute_script('window.scrollTo(0,document.body.scrollHeight)')
self.browser.implicitly_wait(5)
if len(links) >= number_of_posts:
break
links = links[:number_of_posts]
with open('Passed.txt','a') as f:
for link in links:
f.write(elements+'\n')
I've been trying to get all the hrefs of a news article home page. In the end, I want to create something that gives me the n-most used words from all the news articles. To do that, I figured I needed the hrefs first to then click on them one after another.
With a lot of help from another user of this platform, this is the code I've got right now:
from bs4 import BeautifulSoup
from selenium import webdriver
url = 'https://ad.nl'
# launch firefox with your url above
# note that you could change this to some other webdriver (e.g. Chrome)
driver = webdriver.Chrome()
driver.get(url)
# click the "accept cookies" button
btn = driver.find_element_by_name('action')
btn.click()
# grab the html. It'll wait here until the page is finished loading
html = driver.page_source
# parse the html soup
soup = BeautifulSoup(html.lower(), "html.parser")
articles = soup.findAll("article")
for i in articles:
article = driver.find_element_by_class_name('ankeiler')
hrefs = article.find_element_by_css_selector('a').get_attribute('href')
print(hrefs)
driver.quit()
It gives me the first href I think, but it won't iterate over the next ones. It just gives me the first href as many times as it has to iterate. Does anyone know how I make it go to the next href instead of being stuck on the first one?
PS. if anyone has some suggestions on how to further do my little project, feel free to share them as I have a lot of things yet to learn about Python and programming in general.
Instead of using beautiful soup, how about this?
articles = driver.find_elements_by_css_selector('article')
for i in articles:
href = i.find_element_by_css_selector('a').get_attribute('href')
print(href)
To improve on my previous answer, I have written a full solution to your problem:
from selenium import webdriver
url = 'https://ad.nl'
#Set up selenium driver
driver = webdriver.Chrome()
driver.get(url)
#Click the accept cookies button
btn = driver.find_element_by_name('action')
btn.click()
#Get the links of all articles
article_elements = driver.find_elements_by_xpath('//a[#class="ankeiler__link"]')
links = [link.get_attribute('href') for link in article_elements]
#Create a dictionary for every word in the articles
words = dict()
#Iterate through every article
for link in links:
#Get the article
driver.get(link)
#get the elements that are the body of the article
article_elements = driver.find_elements_by_xpath('//*[#class="article__paragraph"]')
#Initalise a empty string
article_text = ''
#Add all the text from the elements to the one string
for element in article_elements:
article_text+= element.text + " "
#Convert all character to lower case
article_text = article_text.lower()
#Remove all punctuation other than spaces
for char in article_text:
if ord(char) > 122 or ord(char) < 97:
if ord(char) != 32:
article_text = article_text.replace(char,"")
#Split the article into words
for word in article_text.split(" "):
#If the word is already in the article update the count
if word in words:
words[word] += 1
#Otherwise make a new entry
else:
words[word] = 1
#Print the final dictionary (Very large so maybe sort for most occurring words and display top 10)
#print(words)
#Sort words by most used
most_used = sorted(words.items(), key=lambda x: x[1],reverse=True)
#Print top 10 used words
print("TOP 10 MOST USED: ")
for i in range(10):
print(most_used[i])
driver.quit()
Works fine for me, let me know if you get any errors.
To get all hrefs in an article you can do:
hrefs = article.find_elements_by_xpath('//a')
#OR article.find_element_by_css_selector('a')
for href in hrefs:
print(href.get_attribute('href'))
To progress with the project though, maybe the bellow would help:
hrefs = article.find_elements_by_xpath('//a')
links = [href.get_attribute("href") for href in hrefs]
for link in link:
driver.get(link)
#Add all words in the article to a dictionary with the key being the words and
#the value being the number of times they occur
I am trying to make a web-scraper that fetches all the results of every google search, however it kept outputting a "web element reference not seen before error". I assume it's due to the code trying to find the element before the url loads, but i am not too sure how to fix it.
from selenium import webdriver
#number of pages
max_page = 5
#number of digits (ie: 2 is 1 digit, 10 is 2 digits)
max_dig = 1
#Open up firefox browser
driver = webdriver.Firefox()
#inputs search into google
question = input("\n What would you like to google today, but replace every space with a '+' (ie: search+this)\n\n")
search = []
#get multiple pages
for i in range(0, max_page + 1):
#inserts page number into google search
page_num = (max_dig - len(str(i))) * "0" + str(i)
#inserts search input and cycles through pages
url = "https://www.google.com/search?q="+ question +"&ei=LV-uXYrpNoj0rAGC8KSYCg&start="+ page_num +"0&sa=N&ved=0ahUKEwjKs8ie367lAhUIOisKHQI4CaM4ChDy0wMIiQE&biw=1356&bih=946"
#finds element in every search page
search+=(driver.find_elements_by_class_name('LC20lb'))
driver.get(url)
#print results
search_items = len(search)
for a in range(search_items):
#print the page number
print(type(search[a].text))
Traceback (most recent call last):
File "screwdriver.py", line 32, in <module>
print(type(search[b].text))
selenium.common.exceptions.NoSuchElementException: Message: Web element reference not seen before: 6187cf00-39c8-c14b-a2de-b1d24e965b65
Problem is that Selenium doesn't keep HTML which you found but rather referece to element on current page. When you load new page - get() - then reference try to find element on new page and it can't find it. You should get text (and any other information) from item before you load new page.
from selenium import webdriver
max_page = 5
driver = webdriver.Firefox()
question = input("\n What would you like to google today, but replace every space with a '+' (ie: search+this)\n\n")
search = []
for i in range(max_page+1):
page_num = str(i)
url = "https://www.google.com/search?q="+ question +"&ei=LV-uXYrpNoj0rAGC8KSYCg&start="+ page_num +"0&sa=N&ved=0ahUKEwjKs8ie367lAhUIOisKHQI4CaM4ChDy0wMIiQE&biw=1356&bih=946"
items = driver.find_elements_by_class_name('LC20lb')
for item in items:
search.append(item.text)
driver.get(url)
for item in search:
print(item)
I am new to programming and need some help with my web-crawler.
At the moment, I have my code opening up every web-page in the list. However, I wish to extract information from each one it loads. This is what I have.
from selenium import webdriver
import csv
driver = webdriver.Firefox()
links_code = driver.find_elements_by_xpath('//a[#class="in-match"]')
first_two = links_code[0:2]
first_two_links = []
for i in first_two:
link = i.get_attribute("href")
first_two_links.append(link)
for i in first_two_links:
driver.get(i)
This loops through the first two pages but scrapes no info. So I tried adding to the for-loop as follows
odds = []
for i in first_two_links:
driver.get(i)
driver.find_element_by_xpath('//span[#class="table-main__detail-
odds--hasarchive"]')
odds.append(odd)
However. This runs into an error.
Any help much appreciated.
You are not actually appending anything! you need to assign a variable to
driver.find_element_by_xpath('//span[#class="table-main__detail-
odds--hasarchive"]')
then append it to the list!
from selenium import webdriver;
import csv;
driver = webdriver.Firefox();
links_code : list = driver.find_elements_by_xpath('//a[#class="in-match"]');
first_two : list = links_code[0:2];
first_two_links : list = [];
i : int;
for i in first_two:
link = i.get_attribute("href");
first_two_links.append(link);
for i in first_two_links:
driver.get(i);
odds : list = [];
i :int;
for i in first_two_links:
driver.get(i);
o = driver.find_element_by_xpath('//span[#class="table-main__detail- odds--hasarchive"]');
odds.append(o);
First, after you start the driver you need to go to a website...
Second, in the second for loop, you are trying to append the wrong object... use i not odd or make odd = driver.find_element_by_xpath('//span[#class="table-main__detail-odds--hasarchive"]')
If you can provide the URL or the HTML we can help more!
Try this (I have used Google as an example you will need to change the code...):
from selenium import webdriver
driver = webdriver.Firefox()
driver.get("https://www.google.com")
links_code = driver.find_elements_by_xpath('//a')
first_two = links_code[0:2]
first_two_links = []
for i in first_two:
link = i.get_attribute("href")
first_two_links.append(link)
print(link)
odds = []
for i in first_two_links:
driver.get(i)
odd = driver.page_source
print(odd)
# driver.find_element_by_xpath('//span[#class="table-main__detail- odds--hasarchive"]')
odds.append(odd)