I have a problem with this particular webstite link to website
I'm trying to create a script that can go through all entries but with the condition that it has "memory" so it can continue from the page it last was on. That means I need to know current page number AND a direct url to that page.
Here is what I have so far:
current_page_el = driver.find_element_by_xpath("//ul[contains(#class, 'pagination')]/li[#class='disabled']/a")
current_page = int(current_page_el.text)
current_page_url = current_page_el.get_attribute("href")
That code will result with
current_page_url = 'javascript:void(0);'
Is there a way to get current url from sites like this? Also, when you click to get to the next page, link just remains the same like what I posted in the beginning.
Related
Context
This is a repost of Get a page with Selenium but wait for element value to not be empty, which was Closed without any validity so far as I can tell.
The linked answers in the closure reasoning both rely on knowing what the expected text value will be. In each answer, it explicitly shows the expected text hardcoded into the WebDriverWait call. Furthermore, neither of the linked answers even remotely touch upon the final part of my question:
[whether the expected conditions] come before or after the page Get
"Duplicate" Questions
How to extract data from the following html?
Assert if text within an element contains specific partial text
Original Question
I'm grabbing a web page using Selenium, but I need to wait for a certain value to load. I don't know what the value will be, only what element it will be present in.
It seems that using the expected condition text_to_be_present_in_element_value or text_to_be_present_in_element is the most likely way forward, but I'm having difficulty finding any actual documentation on how to use these and I don't know if they come before or after the page Get:
webdriver.get(url)
Rephrase
How do I get a page using Selenium but wait for an unknown text value to populate an element's text or value before continuing?
I'm sure that my answer is not the best one but, here is a part of my own code, which helped me with similar to your question.
In my case I had trouble with loading time of the DOM. Sometimes it took 5 sec sometimes 1 sec and so on.
url = 'www.somesite.com'
browser.get(url)
Because in my case browser.implicitly_wait(7) was not enought. I made a simple for loop to check if the content is loaded.
some code...
for try_html in range(7):
""" Make 7 tries to check if the element is loaded """
browser.implicitly_wait(7)
html = browser.page_source
soup = BeautifulSoup(html, 'lxml')
raw_data = soup.find_all('script', type='application/ld+json')
"""if SKU in not found in the html page we skip
for another loop, else we break the
tryes and scrape the page"""
if 'sku' not in html:
continue
else:
scrape(raw_data)
break
It's not perfect, but you can try it.
from this website:
https://search2.ucl.ac.uk/s/search.html?query=max&collection=website-meta&profile=_directory&tab=directory&f.Profile+Type%7Cg=Student&start_rank=1
I need to scrape the next pages 2, 3 ...using Selenium or LXML.
I can only scrape the first page
You can try this:
nextNumberIsThere = True
i=1
while nextNumberIsThere:
driver.execute_script("document.body.scrollHeight");
profileDetails = driver.find_elements_by_xpath("//ul[#class='profile-details']/li")
for element in profileDetails:
print(element.text)
next = driver.find_elements_by_xpath("//a[text()='"+str(i)+"']")
i+=1
if len(next) > 0:
next[0].click()
else:
nextNumberIsThere = False
The above code will iterate and fetch the data until there are no numbers left.
If you want to fetch the name, department, email separately then try the below code :
nextNumberIsThere = True
i=1
while nextNumberIsThere:
driver.execute_script("document.body.scrollHeight");
profileDetails = driver.find_elements_by_xpath("//ul[#class='profile-details']")
for element in profileDetails:
name = element.find_element_by_xpath("./li[#class='fn']")
department = element.find_elements_by_xpath("./li[#class='org']")
email = element.find_element_by_xpath("./li[#class='email']")
print(name.text)
print(department.text)
print(email.text)
print("------------------------------")
next = driver.find_elements_by_xpath("//a[text()='"+str(i)+"']")
i+=1
if len(next) > 0:
next[0].click()
else:
nextNumberIsThere = False
I hope it helps...
Change start_rank in the url. For example:
https://search2.ucl.ac.uk/s/search.html?query=max&collection=website-meta&profile=_directory&tab=directory&f.Profile+Type%7Cg=Student&start_rank=11
The usual solution to this kind of problem is not to use a loop that iterates through "all the pages" (because you don't know how many there are up-front), but rather have some kind of queue, where scraping one page optionally adds subsequent pages to the queue, to be scraped later.
In your specific example, during the scraping of each page you could look for the link to "next page" and, if it's there, add the next page's URL to the queue, so it will be scraped following the current page; once you hit a page with no "next page" link, the queue will empty and scraping will stop.
A more complex example might include scraping a category page and adding each of its sub-categories as a subsequent page to the scraping queue, each of which might in turn add multiple item pages to the queue, etc.
Take a look at scraping frameworks like Scrapy which include this kind of functionality easily in their design. You might find some of its other features useful as well, e.g. its ability to find elements on the page using either XPath or CSS selectors.
The first example on the Scrapy homepage shows exactly the kind of functionality you're trying to implement:
class BlogSpider(scrapy.Spider):
name = 'blogspider'
start_urls = ['https://blog.scrapinghub.com']
def parse(self, response):
for title in response.css('.post-header>h2'):
yield {'title': title.css('a ::text').get()}
for next_page in response.css('a.next-posts-link'):
yield response.follow(next_page, self.parse)
One important note about Scrapy: it doesn't use Selenium (at least not out-of-the-box), but rather downloads the page source and parses it. This means that it doesn't run JavaScript, which might be an issue for you if the website you're scraping is client-generated. In that case, you could look into solutions that combine Scrapy and Selenium (quick googling shows a bunch of them, as well as StackOverflow answers regarding this problem), or you could stick to your Selenium scraping code and implement a queuing mechanism yourself, without Scrapy.
I'm running into the infamous StaleElementReferceExeption error with selenium. I've checked previous questions on the subject, and the common solution is to add and implicit.wait, explicit.wait, or time.sleep to give the website time to load. I've tried this, but I am still experiencing an error. Can anyone tell what the issue is
Here is my code:
links = driver.find_elements_by_css_selector("a.overline-productName")
time.sleep(2)
#finds pricing data of links on page
link_count = 0
for element in links:
links[link_count].click()
cents = driver.find_element_by_css_selector("span.cents")
dollar = driver.find_element_by_css_selector("span.dollar")
text_price = dollar.text + "." + cents.text
price = float(text_price)
print(price)
print(link_count)
driver.execute_script("window.history.go(-1)")
link_count = link_count + 1
time.sleep(5)
what am I missing?
You're storing your links in a list. The second you follow a link to another page, that set of links is stale. So the next iteration in your loop will attempt to click a stale link from the list.
Even if you go back in history as you do later, that original element reference is gone.
Your best bet is to loop through based on index, and find the links each time you return to the page.
recently I tried scraping, so this time i wanted to go from page to page until I get the final destination I want. Here's my code:
sub_categories = browser.find_elements_by_class_name("ty-menu__submenu-link")
for sub_category in sub_categories:
sub_category = str(sub_category.get_attribute("href"))
if(sub_category is not 'http://www.lsbags.co.uk/all-bags/view-all-handbags-en/' and sub_category is not "None"):
browser.get(sub_category)
print("Entered: " + sub_category)
product_titles = browser.find_elements_by_class_name("product-title")
for product_title in product_titles:
final_link = product_title.get_attribute("href")
if(str(final_link) is not "None"):
browser.get(str(final_link))
print("Entered: " + str(final_link))
#DO STUFF
I already tried doing the wait and the wrapper(the try and exception one) solutions from here, but I do not get why its happening, I have an idea why this s happening, because it the browser gets lost right? when it finishes one item?
I don't know how should I express this idea. In my mind I imagine it would be like this:
TIMELINE:
*PAGE 1 is within a loop, ALL THE URLS WITHIN IT IS PROCESSED ONE BY ONE
*The first url of PAGE 1 is caught. Thus do browser.get page turn to PAGE 2
*PAGE 2 has the final list of links I want to evaluate, so another loop here
to get that url, and within that url #DO STUFF
*After #DO STUFF get to the second url of PAGE 2 and #DO STUFF again.
*Let's assume PAGE 2 has only two urls, so it finished looping, so it goes back to PAGE 1
*The second url of PAGE 1 is caught...
and so on... I think I have expressed my idea in some poitn of my code, I dont know what part is not working thus returning the exception.
Any help is appreciated, please help. Thanks!
Problem is that after navigating to the next page but before reaching this page Selenium finds the elements where you are waiting for but this are the elements of the page where you are coming from, after loading the next page this elements are not connected to the Dom anymore but replaced by the ones of the new page but Selenium is going to interact with the elements of the former page wich are no longer attached to the Dom giving a StaleElement exception.
After you pressed on the link for the next page you have to wait till the next page is completly loaded before you start your loop again.
So you have to find something on your page, not being the elements you are going to interact with, that tells you that the next page is loaded.
import urllib
import re
import os
search = (raw_input('[!]Search: '))
site = "http://www.exploit-db.com/list.php?description="+search+"&author=&platform=&type=&port=&osvdb=&cve="
print site
source = urllib.urlopen(site).read()
founds = re.findall("href='/exploits/\d+",source)
print "\n[+]Search",len(founds),"Results\n"
if len(founds) >=1:
for found in founds:
found = found.replace("href='","")
print "http://www.exploit-db.com"+found
else:
print "\nCouldnt find anything with your search\n"
When I search the exploit-db.com site I only come up with 25 results, how can I make it go to the other page or go pass 25 results.
Easy to check by just visiting the site and looking at the URLs as you manually page: just put right after the ? in the URL page=1& to look at the second page of results, or page=2& to look at the third page, and so forth.
How is this a Python question? It's a (very elementary!) "screen scraping" question.
Apparently the exploit-db.com site doesn't allow extending the page size. You therefore need to "manually" page through the result list by repeating the urllib.urlopen() to get subsequent pages. The URL is the same as the one initially used, plus the &page=n parameter. Attention this n value appears to be 0-based (i.e. &page=1 will give the second page)