Searching unique web links - python

I wrote the program to extract the web links from http://www.stevens.edu/.
Now i am facing following problems with the program.
1- i want to get only links starting from http and https
2 - I am getting a parser warning from bs4 concerning the lack of specification on a parser - solved
How can fix this problems? I am not getting proper direction to solve this problem.
my code is -
import urllib2
from bs4 import BeautifulSoup as bs
url = raw_input('Please enter the url for which you want to see unique web links -')
print "\n"
URLs (mostly HTTP) in a complex world
req = urllib2.Request(url, headers={'User-Agent': 'Mozilla/5.0'})
html = urllib2.urlopen(req).read()
soup = bs(html)
tags = soup('a')
count = 0
web_link = []
for tag in tags:
count = count + 1
store = tag.get('href', None)
web_link.append(store)
print "Total no. of extracted web links are",count,"\n"
print web_link
print "\n"
Unique_list = set(web_link)
Unique_list = list(Unique_list)
print "No. of the Unique web links after using set method", len(Unique_list),"\n"

for second problem, you need to specify the parser while creating a bs of the page.
soup = bs(html,"html.parser")
This should remove your warning.

Related

Beautifulsoup "findAll()" does not return the tags

I am trying to build a scraper to get some abstracts of academic papers and their corresponding titles on this page.
The problem is that my for link in bsObj.findAll('a',{'class':'search-track'}) does not return the links I need to further build my scraper. In my code, the check is like this:
for link in bsObj.findAll('a',{'class':'search-track'}):
print(link)
The for loop above does print out anything, however, the href links should be inside the <a class="search-track" ...</a>.
I have referred to this post, but changing the Beautifulsoup parser is not solving the problem of my code. I am using "html.parser" in my Beautifulsoup constructor: bsObj = bs(html.content, features="html.parser").
And the print(len(bsObj)) prints out "3" while it prints out "2" for both "lxml" and "html5lib".
Also, I started off using urllib.request.urlopen to get the page and then tried requests.get() instead. Unfortunately the two approaches give me the same bsObj.
Here is the code I've written:
#from urllib.request import urlopen
import requests
from bs4 import BeautifulSoup as bs
import ssl
'''
The elsevier search is kind of a tree structure:
"keyword --> a list of journals (a journal contain many articles) --> lists of articles
'''
address = input("Please type in your keyword: ") #My keyword is catalyst for water splitting
#https://www.elsevier.com/en-xs/search-results?
#query=catalyst%20for%20water%20splitting&labels=journals&page=1
address = address.replace(" ", "%20")
address = "https://www.elsevier.com/en-xs/search-results?query=" + address + "&labels=journals&page=1"
journals = []
articles = []
def getJournals(url):
global journals
#html = urlopen(url)
html = requests.get(url)
bsObj = bs(html.content, features="html.parser")
#print(len(bsObj))
#testFile = open('testFile.txt', 'wb')
#testFile.write(bsObj.text.encode(encoding='utf-8', errors='strict') +'\n'.encode(encoding='utf-8', errors='strict'))
#testFile.close()
for link in bsObj.findAll('a',{'class':'search-track'}):
print(link)
########does not print anything########
'''
if 'href' in link.attrs and link.attrs['href'] not in journals:
newJournal = link.attrs['href']
journals.append(newJournal)
'''
return None
# Ignore SSL certificate errors
ctx = ssl.create_default_context()
ctx.check_hostname = False
ctx.verify_mode = ssl.CERT_NONE
getJournals(address)
print(journals)
Can anyone tell me what the problem is in my code that the for loop does not print out any links? I need to store the links of journals in a list and then visit each link to scrape the abstracts of papers. By right the abstracts part of a paper is free and the website shouldn't have blocked my ID because of it.
This page is dynamically loaded with jscript, so Beautifulsoup can't handle it directly. You may be able to do it using Selenium, but in this case you can do it by tracking the api calls made by the page (for more see, as one of many examples, here.
In your particular case it can be done this way:
from bs4 import BeautifulSoup as bs
import requests
import json
#this is where the data is hiding:
url = "https://site-search-api.prod.ecommerce.elsevier.com/search?query=catalyst%20for%20water%20splitting&labels=journals&start=0&limit=10&lang=en-xs"
html = requests.get(url)
soup = bs(html.content, features="html.parser")
data = json.loads(str(soup))#response is in json format so we load it into a dictionary
Note: in this case, it's also possible to dispense with Beautifulsoup altogether and load the response directly, as in data = json.loads(html.content). From this point:
hits = data['hits']['hits']#target urls are hidden deep inside nested dictionaries and lists
for hit in hits:
print(hit['_source']['url'])
Ouput:
https://www.journals.elsevier.com/water-research
https://www.journals.elsevier.com/water-research-x
etc.

Extract count of specific links from a web page.

I am writing a python script using BeautifulSoup. I need to scrape a website and count unique links ignoring the links starting with '#'.
Example if the following links exist on a webpage:
https://www.stackoverflow.com/questions
https://www.stackoverflow.com/foo
https://www.cnn.com/
For this example, the only two unique links will be (The link information after the main domain name is removed):
https://stackoverflow.com/ Count 2
https://cnn.com/ Count 1
Note: this is my first time using python and web scraping tools.
I appreciate all the help in advance.
This is what I have tried so far:
from bs4 import BeautifulSoup
import requests
url = 'https://en.wikipedia.org/wiki/Beautiful_Soup_(HTML_parser)'
r = requests.get(url)
soup = BeautifulSoup(r.text, "html.parser")
count = 0
for link in soup.find_all('a'):
print(link.get('href'))
count += 1
There is a function named urlparse from urllib.parse which you can get netloc of urls. And there is a new awesome HTTP library named requests_html which can help you get all links in source file.
from requests_html import HTMLSession
from collections import Counter
from urllib.parse import urlparse
session = HTMLSession()
r = session.get("the link you want to crawl")
unique_netlocs = Counter(urlparse(link).netloc for link in r.html.absolute_links)
for link in unique_netlocs:
print(link, unique_netlocs[link])
You could also do this:
from bs4 import BeautifulSoup
from collections import Counter
import requests
soup = BeautifulSoup(requests.get("https://en.wikipedia.org/wiki/Beautiful_Soup_(HTML_parser)").text, "html.parser")
foundUrls = Counter([link["href"] for link in soup.find_all("a", href=lambda href: href and not href.startswith("#"))])
foundUrls = foundUrls.most_common()
for item in foundUrls:
print ("%s: %d" % (item[0], item[1]))
The soup.find_all line checks if every atag has an href set and if it doesn't start with the # character.
The Counter method counts the occurrences of each list entry and the most_common orders by the value.
The for loop just prints the results.
My way to do this is to find all links using beautiful soup and then determine which link redirects to which location:
def get_count_url(url): # get the umber of links having the same domain and suffix
r = requests.get(url)
soup = BeautifulSoup(r.text, "html.parser")
count = 0
urls={} #dictionary for the domains
# input_domain=url.split('//')[1].split('/')[0]
#library to extract the exact domain( ex.- blog.bbc.com and bbc.com have the same domains )
input_domain=tldextract.extract(url).domain+"."+tldextract.extract(url).suffix
for link in soup.find_all('a'):
word =link.get('href')
# print(word)
if word:
# Same website or domain calls
if "#" in word or word[0]=="/": #div call or same domain call
if not input_domain in urls:
# print(input_domain)
urls[input_domain]=1 #if first encounter with the domain
else:
urls[input_domain]+=1 #multiple encounters
elif "javascript" in word:
# javascript function calls (for domains that use modern JS frameworks to display information)
if not "JavascriptRenderingFunctionCall" in urls:
urls["JavascriptRenderingFunctionCall"]=1
else:
urls["JavascriptRenderingFunctionCall"]+=1
else:
# main_domain=word.split('//')[1].split('/')[0]
main_domain=tldextract.extract(word).domain+"." +tldextract.extract(word).suffix
# print(main_domain)
if main_domain.split('.')[0]=='www':
main_domain = main_domain.replace("www.","") # removing the www
if not main_domain in urls: # maintaining the dictionary
urls[main_domain]=1
else:
urls[main_domain]+=1
count += 1
for key, value in urls.items(): # printing the dictionary in a paragraph format for better readability
print(key,value)
return count
tld extract finds the correct url name and soup.find_all('a') finds a tags. The if statements check for same domain redirect, javascript redirect or other domain redirects.

Extracting specific elements from list python 2.7

I am working on this bot that extracts the urls from a specific page. I have extracted all the links and put them in a list now I can't seem to get realist urls(lead to other sites starting with http or https) out from the list and append them to another list or delete the ones that don't start with http. Thanks in advance
import urllib2
import requests
from bs4 import BeautifulSoup
def main():
#get all the links from bing about cancer
site = "http://www.bing.com/search?q=cancer&qs=n&form=QBLH&pq=cancer&sc=8-4&sp=-1&sk=&cvid=E56491F36028416EB41694212B7C33F2"
urls =[]
true_links = []
r = requests.get(site)
html_content = r.content
soup = BeautifulSoup(html_content, 'html.parser')
links = soup.find_all("a")
for link in links:
link = link.get("href")
urls.append(str(link))
#urls.append(link.get("href"))
#print map(str, urls)
#REMOVE GARBAGE LINKS
print len(urls)
print urls
main()
You can use urlparse.urljoin:
link = urlparse.urljoin(site, link.get("href"))
This will create absolute URLs out of relative ones.
You should also be using html_content = r.text instead of html_content = r.content. r.text takes care of using the proper encoding.

finding unique web links using python

I am writing a program to extract unique web links from www.stevens.edu( it is an assignment ) but there is one problem. My program is working and extracting links for all sites except www.stevens.edu for which i am getting output as 'none'. I am very frustrated with this and need help.i am using this url for testing - http://www.stevens.edu/
import urllib
from bs4 import BeautifulSoup as bs
url = raw_input('enter - ')
html = urllib.urlopen(url).read()
soup = bs (html)
tags = soup ('a')
for tag in tags:
print tag.get('href',None)
please guide me here and let me know why it is not working with www.stevens.edu?
The site check the User-Agent header, and returns different html base on it.
You need to set User-Agent header to get proper html:
import urllib
import urllib2
from bs4 import BeautifulSoup as bs
url = raw_input('enter - ')
req = urllib2.Request(url, headers={'User-Agent': 'Mozilla/5.0'}) # <--
html = urllib2.urlopen(req).read()
soup = bs(html)
tags = soup('a')
for tag in tags:
print tag.get('href', None)

Unable to scrape certain values of a website using regex

I've been trying to scrape the information inside of a particular set of p tags on a website and running into a lot of trouble.
My code looks like:
import urllib
import re
def scrape():
url = "https://www.theWebsite.com"
statusText = re.compile('<div id="holdsThePtagsIwant">(.+?)</div>')
htmlfile = urllib.urlopen(url)
htmltext = htmlfile.read()
status = re.findall(statusText,htmltext)
print("Status: " + str(status))
scrape()
Which unfortunately returns only: "Status: []"
However, that being said I have no idea what I am doing wrong because when I was testing on the same website I could use the code
statusText = re.compile('(.+?)')
instead and I would get what I was trying to, "Status: ['About', 'About']"
Does anyone know what I can do to get the information within the div tags? Or more specifically the single set of p tags the div tags contain? I've tried plugging in just about any values I could think of and have gotten nowhere. After Google, YouTube, and SO searching I'm running out of ideas now.
I use BeautifulSoup for extracting information between html tags. Suppose you want to extract a division like this : <div class='article_body' itemprop='articleBody'>...</div>
then you can use beautifulsoup and extract this division by:
soup = BeautifulSoup(<htmltext>) # creating bs object
ans = soup.find('div', {'class':'article_body', 'itemprop':'articleBody'})
also see the official documentation of bs4
as an example i have edited your code for extracting a division form an article of bloomberg
you can make your own changes
import urllib
import re
from bs4 import BeautifulSoup
def scrape():
url = 'http://www.bloomberg.com/news/2014-02-20/chinese-group-considers-south-africa-platinum-bids-amid-strikes.html'
htmlfile = urllib.urlopen(url)
htmltext = htmlfile.read()
soup = BeautifulSoup(htmltext)
ans = soup.find('div', {'class':'article_body', 'itemprop':'articleBody'})
print ans
scrape()
You can BeautifulSoup from here
P.S. : I use scrapy and BeautifulSoup for web scraping and I am satisfied with it

Categories

Resources