Use python to crawl a website - python

So I am looking for a dynamic way to crawl a website and grab links from each page. I decided to experiment with Beauitfulsoup. Two questions: How do I do this more dynamically then using nested while statements searching for links. I want to get all the links from this site. But I don't want to continue to put nested while loops.
topLevelLinks = self.getAllUniqueLinks(baseUrl)
listOfLinks = list(topLevelLinks)
length = len(listOfLinks)
count = 0
while(count < length):
twoLevelLinks = self.getAllUniqueLinks(listOfLinks[count])
twoListOfLinks = list(twoLevelLinks)
twoCount = 0
twoLength = len(twoListOfLinks)
for twoLinks in twoListOfLinks:
listOfLinks.append(twoLinks)
count = count + 1
while(twoCount < twoLength):
threeLevelLinks = self.getAllUniqueLinks(twoListOfLinks[twoCount])
threeListOfLinks = list(threeLevelLinks)
for threeLinks in threeListOfLinks:
listOfLinks.append(threeLinks)
twoCount = twoCount +1
print '--------------------------------------------------------------------------------------'
#remove all duplicates
finalList = list(set(listOfLinks))
print finalList
My second questions is there anyway to tell if I got all the links from the site. Please forgive me, I am somewhat new to python (year or so) and I know some of my processes and logic might be childish. But I have to learn somehow. Mainly I just want to do this more dynamic then using nested while loop. Thanks in advance for any insight.

The problem of spidering over a web site and getting all the links is a common problem. If you Google search for "spider web site python" you can find libraries that will do this for you. Here's one I found:
http://pypi.python.org/pypi/spider.py/0.5
Even better, Google found this question already asked and answered here on StackOverflow:
Anyone know of a good Python based web crawler that I could use?

If using BeautifulSoup, why don't you use findAll() method ?? Basically, in my crawler i do:
self.soup = BeautifulSoup(HTMLcode)
for frm in self.soup.findAll(str('frame')):
try:
if not frm.has_key('src'):
continue
src = frm[str('src')]
#rest of URL processing here
except Exception, e:
print 'Parser <frame> tag error: ', str(e)
for the frame tag. The same goes for "img src"and "a href" tags.
I like the topic though - maybe its me who has sth wrong here...
edit: there is ofc a top-level instance, which saves the URLs and gets the HTMLcode from each link later...

To answer your question from the comment, here's an example (it's in ruby, but I don't know python, and they are similar enough for you to be able to follow along easily):
#!/usr/bin/env ruby
require 'open-uri'
hyperlinks = []
visited = []
# add all the hyperlinks from a url to the array of urls
def get_hyperlinks url
links = []
begin
s = open(url).read
s.scan(/(href|src)\w*=\w*[\",\']\S+[\",\']/) do
link = $&.gsub(/((href|src)\w*=\w*[\",\']|[\",\'])/, '')
link = url + link if link[0] == '/'
# add to array if not already there
links << link if not links =~ /url/
end
rescue
puts 'Looks like we can\'t be here...'
end
links
end
print 'Enter a start URL: '
hyperlinks << gets.chomp
puts 'Off we go!'
count = 0
while true
break if hyperlinks.length == 0
link = hyperlinks.shift
next if visited.include? link
visited << link
puts "Connecting to #{link}..."
links = get_hyperlinks(link)
puts "Found #{links.length} links on #{link}..."
hyperlinks = links + hyperlinks
puts "Moving on with #{hyperlinks.length} links left...\n\n"
end
sorry about the ruby, but its a better language :P and shouldn't be hard to adapt or, like i said, understand.

1) In Python, we do not count elements of a container and use them to index in; we just iterate over its elements, because that's what we want to do.
2) To handle multiple levels of links, we can use recursion.
def followAllLinks(self, from_where):
for link in list(self.getAllUniqueLinks(from_where)):
self.followAllLinks(link)
This does not handle cycles of links, but neither did the original approach. You can handle that by building a set of already-visited links as you go.

Use scrapy:
Scrapy is a fast high-level screen scraping and web crawling
framework, used to crawl websites and extract structured data from
their pages. It can be used for a wide range of purposes, from data
mining to monitoring and automated testing.

Related

Wikipedia Scraping ambiguous results

I am wiki scraping and an disambigious error comes up because there are multiple articles with the same title. How to do I go through all of them and pull them? Additional question how do I skip them?
text = list('child', 'pca, 'united states')
df = []
for x in text:
wiki = wikipedia.page(x)
df.append(wiki.content)
and multiple results come up for some of them and it will error out, any ideas? I am thinking a try: except: else: ?
The disambiguation notes have a very specific format. It should be easy for you to find them and extract the links they contain. Indeed, the disambiguation links themselves have a unique class that you can search for.
As to whether you pull them or skip them, that's entirely up to you, depending on your need.

Python web scaping recursively (next page)

from this website:
https://search2.ucl.ac.uk/s/search.html?query=max&collection=website-meta&profile=_directory&tab=directory&f.Profile+Type%7Cg=Student&start_rank=1
I need to scrape the next pages 2, 3 ...using Selenium or LXML.
I can only scrape the first page
You can try this:
nextNumberIsThere = True
i=1
while nextNumberIsThere:
driver.execute_script("document.body.scrollHeight");
profileDetails = driver.find_elements_by_xpath("//ul[#class='profile-details']/li")
for element in profileDetails:
print(element.text)
next = driver.find_elements_by_xpath("//a[text()='"+str(i)+"']")
i+=1
if len(next) > 0:
next[0].click()
else:
nextNumberIsThere = False
The above code will iterate and fetch the data until there are no numbers left.
If you want to fetch the name, department, email separately then try the below code :
nextNumberIsThere = True
i=1
while nextNumberIsThere:
driver.execute_script("document.body.scrollHeight");
profileDetails = driver.find_elements_by_xpath("//ul[#class='profile-details']")
for element in profileDetails:
name = element.find_element_by_xpath("./li[#class='fn']")
department = element.find_elements_by_xpath("./li[#class='org']")
email = element.find_element_by_xpath("./li[#class='email']")
print(name.text)
print(department.text)
print(email.text)
print("------------------------------")
next = driver.find_elements_by_xpath("//a[text()='"+str(i)+"']")
i+=1
if len(next) > 0:
next[0].click()
else:
nextNumberIsThere = False
I hope it helps...
Change start_rank in the url. For example:
https://search2.ucl.ac.uk/s/search.html?query=max&collection=website-meta&profile=_directory&tab=directory&f.Profile+Type%7Cg=Student&start_rank=11
The usual solution to this kind of problem is not to use a loop that iterates through "all the pages" (because you don't know how many there are up-front), but rather have some kind of queue, where scraping one page optionally adds subsequent pages to the queue, to be scraped later.
In your specific example, during the scraping of each page you could look for the link to "next page" and, if it's there, add the next page's URL to the queue, so it will be scraped following the current page; once you hit a page with no "next page" link, the queue will empty and scraping will stop.
A more complex example might include scraping a category page and adding each of its sub-categories as a subsequent page to the scraping queue, each of which might in turn add multiple item pages to the queue, etc.
Take a look at scraping frameworks like Scrapy which include this kind of functionality easily in their design. You might find some of its other features useful as well, e.g. its ability to find elements on the page using either XPath or CSS selectors.
The first example on the Scrapy homepage shows exactly the kind of functionality you're trying to implement:
class BlogSpider(scrapy.Spider):
name = 'blogspider'
start_urls = ['https://blog.scrapinghub.com']
def parse(self, response):
for title in response.css('.post-header>h2'):
yield {'title': title.css('a ::text').get()}
for next_page in response.css('a.next-posts-link'):
yield response.follow(next_page, self.parse)
One important note about Scrapy: it doesn't use Selenium (at least not out-of-the-box), but rather downloads the page source and parses it. This means that it doesn't run JavaScript, which might be an issue for you if the website you're scraping is client-generated. In that case, you could look into solutions that combine Scrapy and Selenium (quick googling shows a bunch of them, as well as StackOverflow answers regarding this problem), or you could stick to your Selenium scraping code and implement a queuing mechanism yourself, without Scrapy.

How can I use urllib2, regex, and sys to read a specific URL and then return my search and the results?

I am in need of some help for an assignment. I need to build a "simple" (according to my teacher) web-scraper that takes a URL as an argument, searches the source code of that URL, and then returns the links within that source code back (anything after href). The example URL my teacher has been having us use is http://citstudent.lanecc.net/tools.shtml. When the program is executed, there should be ten links returned as well as the URL of the website.
Since I am still trying to wrap my head around these concepts, I wasn't sure where to start so I turned to stack overflow and I found a script that -kind of- works. It does what I want it to do, but does not fulfill every requirement:
import urllib2
url = "http://citstudent.lanecc.net/tools.shtml"
page = urllib2.urlopen(url)
data = page.read().split("</a>")
tag = "<a href=\""
endtag = "\">"
for item in data:
if "<a href" in item:
try:
ind = item.index(tag)
item = item[ind+len(tag):]
end = item.index(endtag)
except: pass
else:
print item[:end]
This works because I hard-coded the URL into my code, and because it prints after some href tags. Normally I would say to just guide me through this and to not just give me the code, but I'm having such a shit day and any explanation or example of this has to be better than what we went over in class. Thank you.

Python update value within function and reuse it

I searched a lot about this but I might be using the wrong terms, the answers I found are not very relevant or they are too advance for me.
So, I have a very simple program. I have a function that reads a web page, scans for href links using BeautifulSoup, takes one of the links it founds and follows it. The function takes the first link through user input.
Now I want this function to re-run automatically using the link it found, but I only manage to create endless loops by using the first variable it got. This is all done in a controlled environment which has a maximum depth of 10 links.
This is my code:
import urllib
from BeautifulSoup import *
site=list()
def follinks(x):
html = urllib.urlopen(x).read()
bs = BeautifulSoup(html)
tags = bs('a')
for tag in tags:
site.append(tag.get('href', None))
x=site[2]
print x
return;
url1 = raw_input('Enter url:')
How do I make it use the x variable and go back to start and rerun the function until there are no more links to follow? I tried few variations of while true, but again ended in endless loops of the url the user gave.
thanks.
What you're looking for is called recursion. It's where you call a method from within its own body definition.
def follow_links(x):
html = urllib.urlopen(x).read()
bs = BeautifulSoup(html)
# Put all the links on page x into the pagelinks list
pagelinks = []
tags = bs('a')
for tag in tags:
pagelinks.append(tag.get('href', None))
# Track all links from this page in the master sites list
site += pagelinks
# Follow the third link, if there is one
if len(pagelinks) > 2:
follow_links(pagelinks[2])

Finding urls containing a specific string

I haven't used RegEx before, and everyone seems to agree that it's bad for webscraping and html in particular, but I'm not really sure how to solve my little challenge without.
I have a small Python scraper that opens 24 different webpages. In each webpage, there's links to other webpages. I want to make a simple solution that gets the links that I need and even though the webpages are somewhat similar, the links that I want are not.
The only common thing between the urls seems to be a specific string: 'uge' or 'Uge' (uge means week in Danish - and the week number changes every week, duh). It's not like the urls have a common ID or something like that I could use to target the correct ones each time.
I figure it would be possible using RegEx to go through the webpage and find all urls that has 'uge' or 'Uge' in them and then open them. But is there a way to do that using BS? And if I do it using RegEx, how would a possible solution look like?
For example, here are two of the urls I want to grab in different webpages:
http://www.domstol.dk/KobenhavnsByret/retslister/Pages/Uge45-Tvangsauktioner.aspx
http://www.domstol.dk/esbjerg/retslister/Pages/Straffesageruge32.aspx
This should work... The RegEx uge\d\d? tells it to find "uge" followed by a digit, and possibly another one.
import re
for item in listofurls:
l = re.findall("uge\d\d?", item, re.IGNORECASE):
if l:
print item #just do whatever you want to do when it finds it
Yes, you can do this with BeautifulSoup.
from BeautifulSoup import BeautifulSoup
soup = BeautifulSoup(html_string)
# To find just 'Uge##' or 'uge##', as specified in the question:
urls = [el["href"] for el in soup.findAll("a", href=re.compile("[Uu]ge\d+"))]
# To find without regard to case at all:
urls = [el["href"] for el in soup.findAll("a", href=re.compile("(?i)uge\d+"))]
Or just use a simple for loop:
list_of_urls = ["""LIST GOES HERE"""]
for url in list_of_urls:
if 'uge' in url.lower():
# Code to execute
The regex expression would look something like: uge\d\d

Categories

Resources