Python web scaping recursively (next page) - python

from this website:
https://search2.ucl.ac.uk/s/search.html?query=max&collection=website-meta&profile=_directory&tab=directory&f.Profile+Type%7Cg=Student&start_rank=1
I need to scrape the next pages 2, 3 ...using Selenium or LXML.
I can only scrape the first page

You can try this:
nextNumberIsThere = True
i=1
while nextNumberIsThere:
driver.execute_script("document.body.scrollHeight");
profileDetails = driver.find_elements_by_xpath("//ul[#class='profile-details']/li")
for element in profileDetails:
print(element.text)
next = driver.find_elements_by_xpath("//a[text()='"+str(i)+"']")
i+=1
if len(next) > 0:
next[0].click()
else:
nextNumberIsThere = False
The above code will iterate and fetch the data until there are no numbers left.
If you want to fetch the name, department, email separately then try the below code :
nextNumberIsThere = True
i=1
while nextNumberIsThere:
driver.execute_script("document.body.scrollHeight");
profileDetails = driver.find_elements_by_xpath("//ul[#class='profile-details']")
for element in profileDetails:
name = element.find_element_by_xpath("./li[#class='fn']")
department = element.find_elements_by_xpath("./li[#class='org']")
email = element.find_element_by_xpath("./li[#class='email']")
print(name.text)
print(department.text)
print(email.text)
print("------------------------------")
next = driver.find_elements_by_xpath("//a[text()='"+str(i)+"']")
i+=1
if len(next) > 0:
next[0].click()
else:
nextNumberIsThere = False
I hope it helps...

Change start_rank in the url. For example:
https://search2.ucl.ac.uk/s/search.html?query=max&collection=website-meta&profile=_directory&tab=directory&f.Profile+Type%7Cg=Student&start_rank=11

The usual solution to this kind of problem is not to use a loop that iterates through "all the pages" (because you don't know how many there are up-front), but rather have some kind of queue, where scraping one page optionally adds subsequent pages to the queue, to be scraped later.
In your specific example, during the scraping of each page you could look for the link to "next page" and, if it's there, add the next page's URL to the queue, so it will be scraped following the current page; once you hit a page with no "next page" link, the queue will empty and scraping will stop.
A more complex example might include scraping a category page and adding each of its sub-categories as a subsequent page to the scraping queue, each of which might in turn add multiple item pages to the queue, etc.
Take a look at scraping frameworks like Scrapy which include this kind of functionality easily in their design. You might find some of its other features useful as well, e.g. its ability to find elements on the page using either XPath or CSS selectors.
The first example on the Scrapy homepage shows exactly the kind of functionality you're trying to implement:
class BlogSpider(scrapy.Spider):
name = 'blogspider'
start_urls = ['https://blog.scrapinghub.com']
def parse(self, response):
for title in response.css('.post-header>h2'):
yield {'title': title.css('a ::text').get()}
for next_page in response.css('a.next-posts-link'):
yield response.follow(next_page, self.parse)
One important note about Scrapy: it doesn't use Selenium (at least not out-of-the-box), but rather downloads the page source and parses it. This means that it doesn't run JavaScript, which might be an issue for you if the website you're scraping is client-generated. In that case, you could look into solutions that combine Scrapy and Selenium (quick googling shows a bunch of them, as well as StackOverflow answers regarding this problem), or you could stick to your Selenium scraping code and implement a queuing mechanism yourself, without Scrapy.

Related

Figuring out page url in Selenium (javascript:void(0);)

I have a problem with this particular webstite link to website
I'm trying to create a script that can go through all entries but with the condition that it has "memory" so it can continue from the page it last was on. That means I need to know current page number AND a direct url to that page.
Here is what I have so far:
current_page_el = driver.find_element_by_xpath("//ul[contains(#class, 'pagination')]/li[#class='disabled']/a")
current_page = int(current_page_el.text)
current_page_url = current_page_el.get_attribute("href")
That code will result with
current_page_url = 'javascript:void(0);'
Is there a way to get current url from sites like this? Also, when you click to get to the next page, link just remains the same like what I posted in the beginning.

staleelementreferenceexception in Selenium Python on Nested Loops

recently I tried scraping, so this time i wanted to go from page to page until I get the final destination I want. Here's my code:
sub_categories = browser.find_elements_by_class_name("ty-menu__submenu-link")
for sub_category in sub_categories:
sub_category = str(sub_category.get_attribute("href"))
if(sub_category is not 'http://www.lsbags.co.uk/all-bags/view-all-handbags-en/' and sub_category is not "None"):
browser.get(sub_category)
print("Entered: " + sub_category)
product_titles = browser.find_elements_by_class_name("product-title")
for product_title in product_titles:
final_link = product_title.get_attribute("href")
if(str(final_link) is not "None"):
browser.get(str(final_link))
print("Entered: " + str(final_link))
#DO STUFF
I already tried doing the wait and the wrapper(the try and exception one) solutions from here, but I do not get why its happening, I have an idea why this s happening, because it the browser gets lost right? when it finishes one item?
I don't know how should I express this idea. In my mind I imagine it would be like this:
TIMELINE:
*PAGE 1 is within a loop, ALL THE URLS WITHIN IT IS PROCESSED ONE BY ONE
*The first url of PAGE 1 is caught. Thus do browser.get page turn to PAGE 2
*PAGE 2 has the final list of links I want to evaluate, so another loop here
to get that url, and within that url #DO STUFF
*After #DO STUFF get to the second url of PAGE 2 and #DO STUFF again.
*Let's assume PAGE 2 has only two urls, so it finished looping, so it goes back to PAGE 1
*The second url of PAGE 1 is caught...
and so on... I think I have expressed my idea in some poitn of my code, I dont know what part is not working thus returning the exception.
Any help is appreciated, please help. Thanks!
Problem is that after navigating to the next page but before reaching this page Selenium finds the elements where you are waiting for but this are the elements of the page where you are coming from, after loading the next page this elements are not connected to the Dom anymore but replaced by the ones of the new page but Selenium is going to interact with the elements of the former page wich are no longer attached to the Dom giving a StaleElement exception.
After you pressed on the link for the next page you have to wait till the next page is completly loaded before you start your loop again.
So you have to find something on your page, not being the elements you are going to interact with, that tells you that the next page is loaded.

Scrapy Splash missing elements

I've written a spider to crawl the boardgamegeek.com/browse/boardgame site for information regarding boardgames in the list.
My problem is that when pulling two specific selectors in my code, a response is not always received for those selectors, sometimes it returns a selector object other times it doesn't. After inspecting the response during debugging, the dynamically loaded selectors don't exist in the code.
My two offending lines
bggspider.py
bg['txt_cnt'] = response.xpath(
selector_paths.SEL_TXT_REVIEWS).extract_first()
bg['vid_cnt'] = response.xpath(
selector_paths.SEL_VID_REVIEWS).extract_first()
Where the selectors are defined as
selector_paths.py
SEL_TXT_REVIEWS = '//div[#class="panel-inline-
links"]/a[contains(text(), "All Text Reviews")]/text()'
SEL_VID_REVIEWS = '//div[#class="panel-inline-
links"]/a[contains(text(), "All Video Reviews")]/text()'
After yielding the bg item, in the pipeline the attributes are processed where a check is performed since many boardgames have very little information for various parts of the page.
pipelines.py
if item['txt_cnt']:
item['txt_cnt'] = int(re.findall('\d+', item['txt_cnt'])[0])
else:
item['txt_cnt'] = 0
if item['vid_cnt']:
item['vid_cnt'] = int(re.findall('\d+', item['vid_cnt'])[0])
else:
item['vid_cnt'] = 0
The aim of the field processing is just to grab the numerical value in the string which is the number of text and video reviews for a boardgame.
I'm assuming I'm missing something that has to do with Splash since I'm getting selector items for some/most queries but still missing many.
I am running the ScrapySplash docker container locally, localhost:8050.
Code for the spider can be found here. BGGSpider on Github
Any help or information about how to remedy this problem or how ScrapySplash works would be appreciated.

if statement not working for spider in scrapy

I am a python/scrapy newbie. I am trying to scrape a website for practice and basically what I am trying to accomplish is to pull all the companies that are active and download them to a CSV file. You can see my code pasted below I added an IF statement and it doesnt seem to be working and I am not sure what I am doing wrong.
Also I think the spider is crawling the website multiple times based on its output. I only want it to crawl the site once every time I run it.
Just an FYI I did search stackoverflow for the answer and I found a few solutions but I couldn't get any of them to work. I guess this is part of being a rookie.
from scrapy.spider import Spider
from scrapy.selector import Selector
from bizzy.items import BizzyItem
class SunSpider(Spider):
name = "Sun"
allowed_domains = ['sunbiz.org']
start_urls = [
'http://search.sunbiz.org/Inquiry/CorporationSearch/SearchResults/EntityName/a/Page1'
]
def parse(self, response):
sel = Selector(response)
sites = sel.xpath('//tbody/tr')
items = []
for site in sites:
item = BizzyItem()
item["company"] = sel.xpath('//td[1]/a/text()').extract()
item["status"] = sel.xpath('//td[3]/text()').extract()
if item["status"] != 'Active':
pass
else:
items.append(item)
return items
Crawling Multiple Times?
I've had time now to read over your code and glance at the source code for the site you are trying to scrape. First of all, I can tell you from my admittedly limited experience with Scrapy that your spider is not crawling the website multiple times. What you are experiencing is simply the nightmarish wall of debugging output the scrapy devs decided it was a good idea to spew by default. :)
It's actually very useful information if you read through it, and if you can learn to spot patterns you can almost read it as it's whizzing by. I believe they properly use stderr so if you are in a Unix-y environment you can always silence it with scrapy crawl myspider -o output.json -t json 2&>/dev/null (IIRC).
Mysterious if Statement
Because of the nature of extract operating over selectors that might well return multiple elements, it returns a list. If you were to print your result, even though in the xpath you selected down to text(), you would find it looked like this:
[u'string'] # Note the brackets
#^ no little u if you are running this with Python 3.x
You want the first element (only member) of that list, [0]. Fortunately, you can add it right to the method chain you have already constructed for extract:
item["company"] = sel.xpath('//td[1]/a/text()').extract()[0]
item["status"] = sel.xpath('//td[3]/text()').extract()[0]
Then (assuming your xpath is correct - I didn't check it), your conditional should behave as expected. (A list of any size will never equal a string, so you always pass.)

Use python to crawl a website

So I am looking for a dynamic way to crawl a website and grab links from each page. I decided to experiment with Beauitfulsoup. Two questions: How do I do this more dynamically then using nested while statements searching for links. I want to get all the links from this site. But I don't want to continue to put nested while loops.
topLevelLinks = self.getAllUniqueLinks(baseUrl)
listOfLinks = list(topLevelLinks)
length = len(listOfLinks)
count = 0
while(count < length):
twoLevelLinks = self.getAllUniqueLinks(listOfLinks[count])
twoListOfLinks = list(twoLevelLinks)
twoCount = 0
twoLength = len(twoListOfLinks)
for twoLinks in twoListOfLinks:
listOfLinks.append(twoLinks)
count = count + 1
while(twoCount < twoLength):
threeLevelLinks = self.getAllUniqueLinks(twoListOfLinks[twoCount])
threeListOfLinks = list(threeLevelLinks)
for threeLinks in threeListOfLinks:
listOfLinks.append(threeLinks)
twoCount = twoCount +1
print '--------------------------------------------------------------------------------------'
#remove all duplicates
finalList = list(set(listOfLinks))
print finalList
My second questions is there anyway to tell if I got all the links from the site. Please forgive me, I am somewhat new to python (year or so) and I know some of my processes and logic might be childish. But I have to learn somehow. Mainly I just want to do this more dynamic then using nested while loop. Thanks in advance for any insight.
The problem of spidering over a web site and getting all the links is a common problem. If you Google search for "spider web site python" you can find libraries that will do this for you. Here's one I found:
http://pypi.python.org/pypi/spider.py/0.5
Even better, Google found this question already asked and answered here on StackOverflow:
Anyone know of a good Python based web crawler that I could use?
If using BeautifulSoup, why don't you use findAll() method ?? Basically, in my crawler i do:
self.soup = BeautifulSoup(HTMLcode)
for frm in self.soup.findAll(str('frame')):
try:
if not frm.has_key('src'):
continue
src = frm[str('src')]
#rest of URL processing here
except Exception, e:
print 'Parser <frame> tag error: ', str(e)
for the frame tag. The same goes for "img src"and "a href" tags.
I like the topic though - maybe its me who has sth wrong here...
edit: there is ofc a top-level instance, which saves the URLs and gets the HTMLcode from each link later...
To answer your question from the comment, here's an example (it's in ruby, but I don't know python, and they are similar enough for you to be able to follow along easily):
#!/usr/bin/env ruby
require 'open-uri'
hyperlinks = []
visited = []
# add all the hyperlinks from a url to the array of urls
def get_hyperlinks url
links = []
begin
s = open(url).read
s.scan(/(href|src)\w*=\w*[\",\']\S+[\",\']/) do
link = $&.gsub(/((href|src)\w*=\w*[\",\']|[\",\'])/, '')
link = url + link if link[0] == '/'
# add to array if not already there
links << link if not links =~ /url/
end
rescue
puts 'Looks like we can\'t be here...'
end
links
end
print 'Enter a start URL: '
hyperlinks << gets.chomp
puts 'Off we go!'
count = 0
while true
break if hyperlinks.length == 0
link = hyperlinks.shift
next if visited.include? link
visited << link
puts "Connecting to #{link}..."
links = get_hyperlinks(link)
puts "Found #{links.length} links on #{link}..."
hyperlinks = links + hyperlinks
puts "Moving on with #{hyperlinks.length} links left...\n\n"
end
sorry about the ruby, but its a better language :P and shouldn't be hard to adapt or, like i said, understand.
1) In Python, we do not count elements of a container and use them to index in; we just iterate over its elements, because that's what we want to do.
2) To handle multiple levels of links, we can use recursion.
def followAllLinks(self, from_where):
for link in list(self.getAllUniqueLinks(from_where)):
self.followAllLinks(link)
This does not handle cycles of links, but neither did the original approach. You can handle that by building a set of already-visited links as you go.
Use scrapy:
Scrapy is a fast high-level screen scraping and web crawling
framework, used to crawl websites and extract structured data from
their pages. It can be used for a wide range of purposes, from data
mining to monitoring and automated testing.

Categories

Resources