How to improve this web crawler logic? - python

Im working on a web crawler that will crawl only internal links using requests and bs4.
I have a rough working version below but Im not sure how to properly handle checking if a link has been crawled previously or not.
import re
import time
import requests
import argparse
from bs4 import BeautifulSoup
internal_links = set()
def crawler(new_link):
html = requests.get(new_link).text
soup = BeautifulSoup(html, "html.parser")
for link in soup.find_all('a', attrs={'href': re.compile("^http://")}):
if "href" in link.attrs:
print(link)
if link.attrs["href"] not in internal_links:
new_link = link.attrs["href"]
print(new_link)
internal_links.add(new_link)
print("All links found so far, ", internal_links)
time.sleep(6)
crawler(new_link)
def main():
parser = argparse.ArgumentParser()
parser.add_argument('url', help='Pass the website url you wish to crawl')
args = parser.parse_args()
url = args.url
#Check full url has been passed otherwise requests will throw error later
try:
crawler(url)
except:
if url[0:4] != 'http':
print('Please try again and pass the full url eg http://example.com')
if __name__ == '__main__':
main()
These are the last few lines of the output:
All links found so far, {'http://quotes.toscrape.com/tableful', 'http://quotes.toscrape.com', 'http://quotes.toscrape.com/js', 'http://quotes.toscrape.com/scroll', 'http://quotes.toscrape.com/login', 'http://books.toscrape.com', 'http://quotes.toscrape.com/'}
ViewState
http://quotes.toscrape.com/search.aspx
All links found so far, {'http://quotes.toscrape.com/tableful', 'http://quotes.toscrape.com', 'http://quotes.toscrape.com/js', 'http://quotes.toscrape.com/search.aspx', 'http://quotes.toscrape.com/scroll', 'http://quotes.toscrape.com/login', 'http://books.toscrape.com', 'http://quotes.toscrape.com/'}
Random
http://quotes.toscrape.com/random
All links found so far, {'http://quotes.toscrape.com/tableful', 'http://quotes.toscrape.com', 'http://quotes.toscrape.com/js', 'http://quotes.toscrape.com/search.aspx', 'http://quotes.toscrape.com/scroll', 'http://quotes.toscrape.com/random', 'http://quotes.toscrape.com/login', 'http://books.toscrape.com', 'http://quotes.toscrape.com/'}
so it is working, but only up until a certain point and then it doesn't seem to follow the links any further.
Im sure its because of this line
for link in soup.find_all('a', attrs={'href': re.compile("^http://")}):
as that will only find the links that start with http and on a lot of the internal pages the links dont have that but when I try it like this
for link in soup.find_all('a')
the program runs very briefly and then ends:
http://books.toscrape.com
{'href': 'http://books.toscrape.com'}
http://books.toscrape.com
All links found so far, {'http://books.toscrape.com'}
index.html
{'href': 'index.html'}
index.html
All links found so far, {'index.html', 'http://books.toscrape.com'}

You could reduce
for link in soup.find_all('a', attrs={'href': re.compile("^http://")}):
if "href" in link.attrs:
print(link)
if link.attrs["href"] not in internal_links:
new_link = link.attrs["href"]
print(new_link)
internal_links.add(new_link)
To
links = {link['href'] for link in soup.select("a[href^='http:']")}
internal_links.update(links)
This uses a grabs only qualifying a tag elements with http protocol and uses a set comprehension to ensure no dupes. It then updates the existing set with any new links. I don't know enough python to comment on efficiency of using .update but I believe it modifies the existing set rather than creating a new one. More methods for combining sets are listed here: How to join two sets in one line without using "|"

Related

Using BeautifulSoup to Download Links From A WebPage

I wrote a function to find all .pdf files from a web-page & download them. It works well when the link is publicly accessible but when I use it for a course website (which can only be accessed on my university's internet), the pdfs downloaded are corrupted and cannot be opened.
How can I fix it?
def get_pdfs(my_url):
html = urllib2.urlopen(my_url).read()
html_page = BeautifulSoup(html)
current_link = ''
links = []
for link in html_page.find_all('a'):
current_link = link.get('href')
if current_link.endswith('pdf'):
links.append(my_url + current_link)
print(links)
for link in links:
#urlretrieve(link)
wget.download(link)
get_pdfs('https://grader.eecs.jacobs-university.de/courses/320241/2019_2/')
When I use this grader link, the current_link is something like /courses/320241/2019_2/lectures/lecture_7_8.pdf but the /courses/320241/2019_2/ part is already included in the my_url and when I append it, it obviously doesn't work. However, the function works perfectly for [this link][1]:
Is there a way I can use the same function to work with both types of links?
OK, I think I understand the issue now. Try the code below on your data. I think it works, but obviously I couldn't try it directly on the page requiring login. Also, I changed your structure and variable definitions a bit, because I find it easier to think that way, but if it works, you can easily modify it to suit your own tastes.
Anyway, here goes:
import requests
from bs4 import BeautifulSoup as bs
from urllib.parse import urlparse
my_urls = ['https://cnds.jacobs-university.de/courses/os-2019/', 'https://grader.eecs.jacobs-university.de/courses/320241/2019_2']
links = []
for url in my_urls:
resp = requests.get(url)
soup = bs(resp.text,'lxml')
og = soup.find("meta", property="og:url")
base = urlparse(url)
for link in soup.find_all('a'):
current_link = link.get('href')
if current_link.endswith('pdf'):
if og:
links.append(og["content"] + current_link)
else:
links.append(base.scheme+"://"+base.netloc + current_link)
for link in links:
print(link)

List links in web page with python

I am trying to write a python script that lists all the links in a webpage that contain some substring. The problem that I am running into is that the webpage has multiple "pages" so that it doesn't clutter all the screen. Take a look at https://www.go-hero.net/jam/17/solutions/1/1/C++ for an example.
This is what I have so far:
import requests
from bs4 import BeautifulSoup
url = "https://www.go-hero.net/jam/17/solutions/1/1/C++"
response = requests.get(url)
soup = BeautifulSoup(response.content, "html5lib")
links = soup.find_all('a')
for tag in links:
link = tag.get('href', None)
if link is not None and 'GetSource' in link:
print(link)
Any suggestions on how I might get this to work? Thanks in advance.
Edit/Update: Using Selenium, you could click the page links before scraping the html to collect all the content into the html. Many/most websites with pagination don't collect all the text in the html when you click through the pages, but I noticed that the example you provided does. Take a look at this SO question for a quick example of making Selenium work with BeautifulSoup. Here is how you could use it in your code:
import requests
from bs4 import BeautifulSoup
from selenium import webdriver
driver = webdriver.Firefox()
original_url = "https://www.go-hero.net/jam/17/solutions/1/1/C++"
driver.get(original_url)
# click the links for pages 1-29
for i in range(1, 30):
path_string = '/jam/17/solutions/1/1/C++#page-' + str(i)
driver.find_element_by_xpath('//a[#href=' + path_string + ']').click()
# scrape from the accumulated html
html = driver.page_source
soup = BeautifulSoup(html)
links = soup.find_all('a')
# proceed as normal from here
for tag in links:
link = tag.get('href', None)
if link is not None and 'GetSource' in link:
print(link)
Original Answer: For the link you provided above, you could simply loop through possible urls and run your scraping code in the loop:
import requests
from bs4 import BeautifulSoup
original_url = "https://www.go-hero.net/jam/17/solutions/1/1/C++"
# scrape from the original page (has no page number)
response = requests.get(original_url)
soup = BeautifulSoup(response.content, "html5lib")
links = soup.find_all('a')
# prepare to scrape from the pages numbered 1-29
# (note that the original page is not numbered, and the next page is "#page-1")
url_suffix = '#page-'
for i in range(1, 30):
# add page number to the url
paginated_url = original_url + url_suffix + str(i)
response = requests.get(paginated_url)
soup = BeautifulSoup(response.content, "html5lib")
# append resulting list to 'links' list
links += soup.find_all('a')
# proceed as normal from here
for tag in links:
link = tag.get('href', None)
if link is not None and 'GetSource' in link:
print(link)
I don't know if you mind that you'll get duplicates in your results. You will get duplicate results in your link list as the code currently stands, but you could add the links to a Set or something instead to easily remedy that.

How to sift through specific items from a webpage using conditional statement

I've made a scraper in python. It is running smoothly. Now I would like to discard or accept specific links from that page as in, links only containing "mobiles" but even after making some conditional statement I can't do so. Hope I'm gonna get any help to rectify my mistakes.
import requests
from bs4 import BeautifulSoup
def SpecificItem():
url = 'https://www.flipkart.com/'
Process = requests.get(url)
soup = BeautifulSoup(Process.text, "lxml")
for link in soup.findAll('div',class_='')[0].findAll('a'):
if "mobiles" not in link:
print(link.get('href'))
SpecificItem()
On the other hand if I do the same thing using lxml library with xpath, It works.
import requests
from lxml import html
def SpecificItem():
url = 'https://www.flipkart.com/'
Process = requests.get(url)
tree = html.fromstring(Process.text)
links = tree.xpath('//div[#class=""]//a/#href')
for link in links:
if "mobiles" not in link:
print(link)
SpecificItem()
So, at this point i think with BeautifulSoup library the code should be somewhat different to get the purpose served.
The root of your problem is your if condition works a bit differently between BeautifulSoup and lxml. Basically, if "mobiles" not in link: with BeautifulSoup is not checking if "mobiles" is in the href field. I didn't look too hard but I'd guess it's comparing it to the link.text field instead. Explicitly using the href field does the trick:
import requests
from bs4 import BeautifulSoup
def SpecificItem():
url = 'https://www.flipkart.com/'
Process = requests.get(url)
soup = BeautifulSoup(Process.text, "lxml")
for link in soup.findAll('div',class_='')[0].findAll('a'):
href = link.get('href')
if "mobiles" not in href:
print(href)
SpecificItem()
That prints out a bunch of links and none of them include "mobiles".

Python not progressing a list of links

So, as i need more detailed data I have to dig a bit deeper in the HTML code of a website. I wrote a script that returns me a list of specific links to detail pages, but I can't bring Python to search each link of this list for me, it always stops at the first one. What am I doing wrong?
from BeautifulSoup import BeautifulSoup
import urllib2
from lxml import html
import requests
#Open site
html_page = urllib2.urlopen("http://www.sitetoscrape.ch/somesite.aspx")
#Inform BeautifulSoup
soup = BeautifulSoup(html_page)
#Search for the specific links
for link in soup.findAll('a', href=re.compile('/d/part/of/thelink/ineed.aspx')):
#print found links
print link.get('href')
#complete links
complete_links = 'http://www.sitetoscrape.ch' + link.get('href')
#print complete links
print complete_links
#
#EVERYTHING WORKS FINE TO THIS POINT
#
page = requests.get(complete_links)
tree = html.fromstring(page.text)
#Details
name = tree.xpath('//dl[#class="services"]')
for i in name:
print i.text_content()
Also: What tutorial can you recommend me to learn how to put my output in a file and clean it up, give variable names, etc?
I think that you want a list of links in complete_links instead of a single link. As #Pynchia and #lemonhead said you're overwritting complete_links every iteration of first for loop.
You need two changes:
Append links to a list and use it to loop and scrap each link
# [...] Same code here
links_list = []
for link in soup.findAll('a', href=re.compile('/d/part/of/thelink/ineed.aspx')):
print link.get('href')
complete_links = 'http://www.sitetoscrape.ch' + link.get('href')
print complete_links
link_list.append(complete_links) # append new link to the list
Scrap each accumulated link in another loop
for link in link_list:
page = requests.get(link)
tree = html.fromstring(page.text)
#Details
name = tree.xpath('//dl[#class="services"]')
for i in name:
print i.text_content()
PS: I recommend scrapy framework for tasks like that.

Python and BeautifulSoup Opening pages

I am wondering how would I open another page in my list with BeautifulSoup? I have followed this tutorial, but it does not tell us how to open another page on the list. Also how would I open a "a href" that is nested inside of a class?
Here is my code:
# coding: utf-8
import requests
from bs4 import BeautifulSoup
r = requests.get("")
soup = BeautifulSoup(r.content)
soup.find_all("a")
for link in soup.find_all("a"):
print link.get("href")
for link in soup.find_all("a"):
print link.text
for link in soup.find_all("a"):
print link.text, link.get("href")
g_data = soup.find_all("div", {"class":"listing__left-column"})
for item in g_data:
print item.contents
for item in g_data:
print item.contents[0].text
print link.get('href')
for item in g_data:
print item.contents[0]
I am trying to collect the href's from the titles of each business, and then open them and scrape that data.
I am still not sure where you are getting the HTML from, but if you are trying to extract all of the href tags, then the following approach should work based on the image you have posted:
import requests
from bs4 import BeautifulSoup
r = requests.get("<add your URL here>")
soup = BeautifulSoup(r.content)
for a_tag in soup.find_all('a', class_='listing-name', href=True):
print 'href: ', a_tag['href']
By adding href=True to the find_all(), it ensures that only a elements that contain an href attribute are returned therefore removing the need to test for it as an attribute.
Just to warn you, you might find some websites will lock you out after one or two attempts as they are able to detect that you are trying to access a site via a script, rather than as a human. If you feel you are not getting the correct responses, I would recommend printing the HTML you are getting back to ensure it it still as you expect.
If you then want to get the HTML for each of the links, the following could be used:
import requests
from bs4 import BeautifulSoup
# Configure this to be your first request URL
r = requests.get("http://www.mywebsite.com/search/")
soup = BeautifulSoup(r.content)
for a_tag in soup.find_all('a', class_='listing-name', href=True):
print 'href: ', a_tag['href']
# Configure this to the root of the above website, e.g. 'http://www.mywebsite.com'
base_url = "http://www.mywebsite.com"
for a_tag in soup.find_all('a', class_='listing-name', href=True):
print '-' * 60 # Add a line of dashes
print 'href: ', a_tag['href']
request_href = requests.get(base_url + a_tag['href'])
print request_href.content
Tested using Python 2.x, for Python 3.x please add parentheses to the print statements.
I had the same problem and I will like to share my findings because I did try the answer, for some reasons it did not work but after some research I found something interesting.
You might need to find the attributes of the "href" link itself:
You will need the exact class which contains the href link in your case, I am thinking="class":"listing__left-column" and equate it to a variable say "all" for example:
from bs4 import BeautifulSoup
all = soup.find_all("div", {"class":"listing__left-column"})
for item in all:
for link in item.find_all("a"):
if 'href' in link.attrs:
a = link.attrs['href']
print(a)
print("")
I did this and I was able to get into another link which was embedded in the home page

Categories

Resources