sanitize && build url

sanitize && build url - python

Is it possible, whith beautifulsoup (python), to extract urls absolute instead of relative urls of a web page ?
For example, when I scrap http://bing.com and ask a href links :
for link in soup.findAll('a'):
It return as well as relative than absolute urls :
http://bing.com/?scope=web&FORM=Z9LH
/maps/?FORM=Z9LH3
/news?FORM=Z9LH4
/explore?FORM=Z9LH5
/profile/history?FORM=Z9LH6
http://fr.msn.com/
http://www.office.com?WT.mc_id=O16_BingHP
Many thanks.

If you want to only match absolute URLs, the simplest way to do it would be to use a CSS selector:
soup.select("a[href^=http]")
Here ^= means "starts with".
If you want to locate all the links and make absolute URLs out of relative URLs, use urljoin():
from urlparse import urljoin
# Python 3: from urllib.parse import urljoin
base_url = "http://bing.com"
for link in soup.find_all("a", href=True):
absolute_url = urljoin(base_url, link["href"])
print(absolute_url)
Note that if the URL is already absolute, urljoin() would leave it as is.

Use filter() and lambdas.
urlList = filter(lambda aTag: aTag['href'].startswith('http'), soup('a'))
should do the trick.
In short, check the whether the 'href' attribute of your links start with the string 'http'.
If you want to recreate absolute URLs from the relatives ones, you can do this:
urlThatCurrentlyScraping = 'http://bing.com/something/...'
for link in soup('a'):
if not link['href'].startswith('http'):
fixedLinkHref = urlThatCurrentlyScraping + link['href']
else:
fixedLinkHref = link['href']
# do something

Related

make_links_absolute() results in broken absolute URLs

I need to convert relative URLs from a HTML page to absolute ones. I'm using pyquery for parsing.
For instance, this page http://govp.info/o-gorode/gorozhane has relative URLs in the source code, like
2
(this is the pagination link at the bottom of the page). I'm trying to use make_links_absolute():
import requests
from pyquery import PyQuery as pq
page_url = 'http://govp.info/o-gorode/gorozhane'
resp = requests.get(page_url)
page = pq(resp.text)
page.make_links_absolute(page_url)
but it seems that this breaks the relative links:
print(page.find('a[href*="?page=2"]').attr['href'])
# prints http://govp.info/o-gorode/o-gorode/gorozhane?page=2
# expected value http://govp.info/o-gorode/gorozhane?page=2
As you can see there is doubled o-gorode in the middle of the final URL that definitely will produce 404 error.
Internally pyquery uses urljoin from the standard urllib.parse module, somewhat like this:
from urllib.parse import urljoin
urljoin('http://example.com/one/', 'two')
# -> 'http://example.com/one/two'
It's ok, but there are a lot of sites that have, hmm, unusual relative links with a full path.
And in this case urljoin will give us an invalid absolute link:
urljoin('http://govp.info/o-gorode/gorozhane', 'o-gorode/gorozhane?page=2')
# -> 'http://govp.info/o-gorode/o-gorode/gorozhane?page=2'
I believe such relative links are not very valid, but Google Chrome has no problem to deal with them; so I guess this is kind of normal across the web.
Are there any advice how to solve this problem? I tried furl but it does the same join.

In this particular case, the page in question contains
<base href="http://govp.info/"/>
which instructs the browser to use this for resolving any relative links. The <base> element is optional, but if it's there, you must use it instead of the page's actual URL.
In order to do as the browser does, extract the base href and use it in make_links_absolute().
import requests
from pyquery import PyQuery as pq
page_url = 'http://govp.info/o-gorode/gorozhane'
resp = requests.get(page_url)
page = pq(resp.text)
base = page.find('base').attr['href']
if base is None:
base = page_url # the page's own URL is the fallback
page.make_links_absolute(base)
for a in page.find('a'):
if 'href' in a.attrib and 'govp.info' in a.attrib['href']:
print(a.attrib['href'])
prints
http://govp.info/assets/images/map.png
http://govp.info/podpiska.html
http://govp.info/
http://govp.info/#order
...
http://govp.info/o-gorode/gorozhane
http://govp.info/o-gorode/gorozhane?page=2
http://govp.info/o-gorode/gorozhane?page=3
http://govp.info/o-gorode/gorozhane?page=4
http://govp.info/o-gorode/gorozhane?page=5
http://govp.info/o-gorode/gorozhane?page=6
http://govp.info/o-gorode/gorozhane?page=2
http://govp.info/o-gorode/gorozhane?page=17
http://govp.info/bannerclick/264
...
http://doska.govp.info/cat-biznes-uslugi/
http://doska.govp.info/cat-transport/legkovye-avtomobili/
http://doska.govp.info/
http://govp.info/
which seems to be correct.

Using Beautiful Soup to Scrape content encoded in unicode? [duplicate]

Closed. This question needs details or clarity. It is not currently accepting answers.
Want to improve this question? Add details and clarify the problem by editing this post.
Closed 2 years ago.
Improve this question
How can I retrieve the links of a webpage and copy the url address of the links using Python?

Here's a short snippet using the SoupStrainer class in BeautifulSoup:
import httplib2
from bs4 import BeautifulSoup, SoupStrainer
http = httplib2.Http()
status, response = http.request('http://www.nytimes.com')
for link in BeautifulSoup(response, parse_only=SoupStrainer('a')):
if link.has_attr('href'):
print(link['href'])
The BeautifulSoup documentation is actually quite good, and covers a number of typical scenarios:
https://www.crummy.com/software/BeautifulSoup/bs4/doc/
Edit: Note that I used the SoupStrainer class because it's a bit more efficient (memory and speed wise), if you know what you're parsing in advance.

For completeness sake, the BeautifulSoup 4 version, making use of the encoding supplied by the server as well:
from bs4 import BeautifulSoup
import urllib.request
parser = 'html.parser' # or 'lxml' (preferred) or 'html5lib', if installed
resp = urllib.request.urlopen("http://www.gpsbasecamp.com/national-parks")
soup = BeautifulSoup(resp, parser, from_encoding=resp.info().get_param('charset'))
for link in soup.find_all('a', href=True):
print(link['href'])
or the Python 2 version:
from bs4 import BeautifulSoup
import urllib2
parser = 'html.parser' # or 'lxml' (preferred) or 'html5lib', if installed
resp = urllib2.urlopen("http://www.gpsbasecamp.com/national-parks")
soup = BeautifulSoup(resp, parser, from_encoding=resp.info().getparam('charset'))
for link in soup.find_all('a', href=True):
print link['href']
and a version using the requests library, which as written will work in both Python 2 and 3:
from bs4 import BeautifulSoup
from bs4.dammit import EncodingDetector
import requests
parser = 'html.parser' # or 'lxml' (preferred) or 'html5lib', if installed
resp = requests.get("http://www.gpsbasecamp.com/national-parks")
http_encoding = resp.encoding if 'charset' in resp.headers.get('content-type', '').lower() else None
html_encoding = EncodingDetector.find_declared_encoding(resp.content, is_html=True)
encoding = html_encoding or http_encoding
soup = BeautifulSoup(resp.content, parser, from_encoding=encoding)
for link in soup.find_all('a', href=True):
print(link['href'])
The soup.find_all('a', href=True) call finds all <a> elements that have an href attribute; elements without the attribute are skipped.
BeautifulSoup 3 stopped development in March 2012; new projects really should use BeautifulSoup 4, always.
Note that you should leave decoding the HTML from bytes to BeautifulSoup. You can inform BeautifulSoup of the characterset found in the HTTP response headers to assist in decoding, but this can be wrong and conflicting with a <meta> header info found in the HTML itself, which is why the above uses the BeautifulSoup internal class method EncodingDetector.find_declared_encoding() to make sure that such embedded encoding hints win over a misconfigured server.
With requests, the response.encoding attribute defaults to Latin-1 if the response has a text/* mimetype, even if no characterset was returned. This is consistent with the HTTP RFCs but painful when used with HTML parsing, so you should ignore that attribute when no charset is set in the Content-Type header.

Others have recommended BeautifulSoup, but it's much better to use lxml. Despite its name, it is also for parsing and scraping HTML. It's much, much faster than BeautifulSoup, and it even handles "broken" HTML better than BeautifulSoup (their claim to fame). It has a compatibility API for BeautifulSoup too if you don't want to learn the lxml API.
Ian Blicking agrees.
There's no reason to use BeautifulSoup anymore, unless you're on Google App Engine or something where anything not purely Python isn't allowed.
lxml.html also supports CSS3 selectors so this sort of thing is trivial.
An example with lxml and xpath would look like this:
import urllib
import lxml.html
connection = urllib.urlopen('http://www.nytimes.com')
dom = lxml.html.fromstring(connection.read())
for link in dom.xpath('//a/#href'): # select the url in href for all a tags(links)
print link

import urllib2
import BeautifulSoup
request = urllib2.Request("http://www.gpsbasecamp.com/national-parks")
response = urllib2.urlopen(request)
soup = BeautifulSoup.BeautifulSoup(response)
for a in soup.findAll('a'):
if 'national-park' in a['href']:
print 'found a url with national-park in the link'

The following code is to retrieve all the links available in a webpage using urllib2 and BeautifulSoup4:
import urllib2
from bs4 import BeautifulSoup
url = urllib2.urlopen("http://www.espncricinfo.com/").read()
soup = BeautifulSoup(url)
for line in soup.find_all('a'):
print(line.get('href'))

Links can be within a variety of attributes so you could pass a list of those attributes to select.
For example, with src and href attributes (here I am using the starts with ^ operator to specify that either of these attributes values starts with http):
from bs4 import BeautifulSoup as bs
import requests
r = requests.get('https://stackoverflow.com/')
soup = bs(r.content, 'lxml')
links = [item['href'] if item.get('href') is not None else item['src'] for item in soup.select('[href^="http"], [src^="http"]') ]
print(links)
Attribute = value selectors
[attr^=value]
Represents elements with an attribute name of attr whose value is prefixed (preceded) by value.
There are also the commonly used $ (ends with) and * (contains) operators. For a full syntax list see the link above.

Under the hood BeautifulSoup now uses lxml. Requests, lxml & list comprehensions makes a killer combo.
import requests
import lxml.html
dom = lxml.html.fromstring(requests.get('http://www.nytimes.com').content)
[x for x in dom.xpath('//a/#href') if '//' in x and 'nytimes.com' not in x]
In the list comp, the "if '//' and 'url.com' not in x" is a simple method to scrub the url list of the sites 'internal' navigation urls, etc.

just for getting the links, without B.soup and regex:
import urllib2
url="http://www.somewhere.com"
page=urllib2.urlopen(url)
data=page.read().split("</a>")
tag="<a href=\""
endtag="\">"
for item in data:
if "<a href" in item:
try:
ind = item.index(tag)
item=item[ind+len(tag):]
end=item.index(endtag)
except: pass
else:
print item[:end]
for more complex operations, of course BSoup is still preferred.

This script does what your looking for, But also resolves the relative links to absolute links.
import urllib
import lxml.html
import urlparse
def get_dom(url):
connection = urllib.urlopen(url)
return lxml.html.fromstring(connection.read())
def get_links(url):
return resolve_links((link for link in get_dom(url).xpath('//a/#href')))
def guess_root(links):
for link in links:
if link.startswith('http'):
parsed_link = urlparse.urlparse(link)
scheme = parsed_link.scheme + '://'
netloc = parsed_link.netloc
return scheme + netloc
def resolve_links(links):
root = guess_root(links)
for link in links:
if not link.startswith('http'):
link = urlparse.urljoin(root, link)
yield link
for link in get_links('http://www.google.com'):
print link

To find all the links, we will in this example use the urllib2 module together
with the re.module
*One of the most powerful function in the re module is "re.findall()".
While re.search() is used to find the first match for a pattern, re.findall() finds all
the matches and returns them as a list of strings, with each string representing one match*
import urllib2
import re
#connect to a URL
website = urllib2.urlopen(url)
#read html code
html = website.read()
#use re.findall to get all the links
links = re.findall('"((http|ftp)s?://.*?)"', html)
print links

Why not use regular expressions:
import urllib2
import re
url = "http://www.somewhere.com"
page = urllib2.urlopen(url)
page = page.read()
links = re.findall(r"<a.*?\s*href=\"(.*?)\".*?>(.*?)</a>", page)
for link in links:
print('href: %s, HTML text: %s' % (link[0], link[1]))

Here's an example using #ars accepted answer and the BeautifulSoup4, requests, and wget modules to handle the downloads.
import requests
import wget
import os
from bs4 import BeautifulSoup, SoupStrainer
url = 'https://archive.ics.uci.edu/ml/machine-learning-databases/eeg-mld/eeg_full/'
file_type = '.tar.gz'
response = requests.get(url)
for link in BeautifulSoup(response.content, 'html.parser', parse_only=SoupStrainer('a')):
if link.has_attr('href'):
if file_type in link['href']:
full_path = url + link['href']
wget.download(full_path)

I found the answer by #Blairg23 working , after the following correction (covering the scenario where it failed to work correctly):
for link in BeautifulSoup(response.content, 'html.parser', parse_only=SoupStrainer('a')):
if link.has_attr('href'):
if file_type in link['href']:
full_path =urlparse.urljoin(url , link['href']) #module urlparse need to be imported
wget.download(full_path)
For Python 3:
urllib.parse.urljoin has to be used in order to obtain the full URL instead.

BeatifulSoup's own parser can be slow. It might be more feasible to use lxml which is capable of parsing directly from a URL (with some limitations mentioned below).
import lxml.html
doc = lxml.html.parse(url)
links = doc.xpath('//a[#href]')
for link in links:
print link.attrib['href']
The code above will return the links as is, and in most cases they would be relative links or absolute from the site root. Since my use case was to only extract a certain type of links, below is a version that converts the links to full URLs and which optionally accepts a glob pattern like *.mp3. It won't handle single and double dots in the relative paths though, but so far I didn't have the need for it. If you need to parse URL fragments containing ../ or ./ then urlparse.urljoin might come in handy.
NOTE: Direct lxml url parsing doesn't handle loading from https and doesn't do redirects, so for this reason the version below is using urllib2 + lxml.
#!/usr/bin/env python
import sys
import urllib2
import urlparse
import lxml.html
import fnmatch
try:
import urltools as urltools
except ImportError:
sys.stderr.write('To normalize URLs run: `pip install urltools --user`')
urltools = None
def get_host(url):
p = urlparse.urlparse(url)
return "{}://{}".format(p.scheme, p.netloc)
if __name__ == '__main__':
url = sys.argv[1]
host = get_host(url)
glob_patt = len(sys.argv) > 2 and sys.argv[2] or '*'
doc = lxml.html.parse(urllib2.urlopen(url))
links = doc.xpath('//a[#href]')
for link in links:
href = link.attrib['href']
if fnmatch.fnmatch(href, glob_patt):
if not href.startswith(('http://', 'https://' 'ftp://')):
if href.startswith('/'):
href = host + href
else:
parent_url = url.rsplit('/', 1)[0]
href = urlparse.urljoin(parent_url, href)
if urltools:
href = urltools.normalize(href)
print href
The usage is as follows:
getlinks.py http://stackoverflow.com/a/37758066/191246
getlinks.py http://stackoverflow.com/a/37758066/191246 "*users*"
getlinks.py http://fakedomain.mu/somepage.html "*.mp3"

There can be many duplicate links together with both external and internal links. To differentiate between the two and just get unique links using sets:
# Python 3.
import urllib
from bs4 import BeautifulSoup
url = "http://www.espncricinfo.com/"
resp = urllib.request.urlopen(url)
# Get server encoding per recommendation of Martijn Pieters.
soup = BeautifulSoup(resp, from_encoding=resp.info().get_param('charset'))
external_links = set()
internal_links = set()
for line in soup.find_all('a'):
link = line.get('href')
if not link:
continue
if link.startswith('http'):
external_links.add(link)
else:
internal_links.add(link)
# Depending on usage, full internal links may be preferred.
full_internal_links = {
urllib.parse.urljoin(url, internal_link)
for internal_link in internal_links
}
# Print all unique external and full internal links.
for link in external_links.union(full_internal_links):
print(link)

import urllib2
from bs4 import BeautifulSoup
a=urllib2.urlopen('http://dir.yahoo.com')
code=a.read()
soup=BeautifulSoup(code)
links=soup.findAll("a")
#To get href part alone
print links[0].attrs['href']

Using BeautifulSoup to Download Links From A WebPage

I wrote a function to find all .pdf files from a web-page & download them. It works well when the link is publicly accessible but when I use it for a course website (which can only be accessed on my university's internet), the pdfs downloaded are corrupted and cannot be opened.
How can I fix it?
def get_pdfs(my_url):
html = urllib2.urlopen(my_url).read()
html_page = BeautifulSoup(html)
current_link = ''
links = []
for link in html_page.find_all('a'):
current_link = link.get('href')
if current_link.endswith('pdf'):
links.append(my_url + current_link)
print(links)
for link in links:
#urlretrieve(link)
wget.download(link)
get_pdfs('https://grader.eecs.jacobs-university.de/courses/320241/2019_2/')
When I use this grader link, the current_link is something like /courses/320241/2019_2/lectures/lecture_7_8.pdf but the /courses/320241/2019_2/ part is already included in the my_url and when I append it, it obviously doesn't work. However, the function works perfectly for [this link][1]:
Is there a way I can use the same function to work with both types of links?

OK, I think I understand the issue now. Try the code below on your data. I think it works, but obviously I couldn't try it directly on the page requiring login. Also, I changed your structure and variable definitions a bit, because I find it easier to think that way, but if it works, you can easily modify it to suit your own tastes.
Anyway, here goes:
import requests
from bs4 import BeautifulSoup as bs
from urllib.parse import urlparse
my_urls = ['https://cnds.jacobs-university.de/courses/os-2019/', 'https://grader.eecs.jacobs-university.de/courses/320241/2019_2']
links = []
for url in my_urls:
resp = requests.get(url)
soup = bs(resp.text,'lxml')
og = soup.find("meta", property="og:url")
base = urlparse(url)
for link in soup.find_all('a'):
current_link = link.get('href')
if current_link.endswith('pdf'):
if og:
links.append(og["content"] + current_link)
else:
links.append(base.scheme+"://"+base.netloc + current_link)
for link in links:
print(link)

Predict if sites returns the same content

I am writing a web crawler, but I have a problem with function which recursively calls links.
Let's suppose I have a page: http://en.wikipedia.org/wiki/Stirling_numbers_of_the_second_kind.
I am looking for all links, and then open each link recursively, downloading again all links etc.
The problem is, that some links, although have different urls, drive to the same page, for example:
http://en.wikipedia.org/wiki/Stirling_numbers_of_the_second_kind#mw-navigation
gives the same page as the previous link.
And I have an infinite loop.
Is any possibility to check if two links drive to the same page without comparing the all content of this pages?

You can store the hash of the content of pages previously seen and check if the page has already been seen before continuing.

No need to make extra requests to the same page.
You can use urlparse() and check if the .path part of the base url and the link you crawl is the same:
from urllib2 import urlopen
from urlparse import urljoin, urlparse
from bs4 import BeautifulSoup
url = "http://en.wikipedia.org/wiki/Stirling_numbers_of_the_second_kind"
base_url = urlparse(url)
soup = BeautifulSoup(urlopen(url))
for link in soup.find_all('a'):
if 'href' in link.attrs:
url = urljoin(url, link['href'])
print url, urlparse(url).path == base_url.path
Prints:
http://en.wikipedia.org/wiki/Stirling_numbers_of_the_second_kind#mw-navigation True
http://en.wikipedia.org/wiki/Stirling_numbers_of_the_second_kind#p-search True
http://en.wikipedia.org/wiki/File:Set_partitions_4;_Hasse;_circles.svg False
...
http://en.wikipedia.org/wiki/Equivalence_relation False
...
http://en.wikipedia.org/wiki/Stirling_numbers_of_the_second_kind True
...
https://www.mediawiki.org/ False
This particular example uses BeautifulSoup to parse the wikipedia page and get all links, but the actual html parser here is not really important. Important is that you parse the links and get the path to check.

Scrape specific urls from a page and convert them to absolute urls

I need some help from you Pythonists: I'm scraping all urls starting with "details.php?" from this page and ignoring all other urls.
Then I need to convert every url I just scraped to an absolute url, so I can scrape them one by one. The absolute urls start with: http://evenementen.uitslagen.nl/2013/marathonrotterdam/details.php?...
I tried using re.findall like this:
html = scraperwiki.scrape(url)
if html is not None:
endofurl = re.findall("details.php?(.*?)>", html)
This gets me a list, but then I get stuck. Can anybody help me out?

You can use urlparse.urljoin() to create the full urls:
>>> import urlparse
>>> base_url = 'http://evenementen.uitslagen.nl/2013/marathonrotterdam/'
>>> urlparse.urljoin(base_url, 'details.php?whatever')
'http://evenementen.uitslagen.nl/2013/marathonrotterdam/details.php?whatever'
You can use a list comprehension to do this for all of your urls:
full_urls = [urlparse.urljoin(base_url, url) for url in endofurl]

If you need the final urls one by one and be done with them, you should use generator instead of the iterators.
abs_url = "url data"
urls = (abs_url+url for url in endofurl)
If you are worried about encoding the url you can use urllib.urlencode(url)

Ah! My favorite...list comprehensions!
base_url = 'http://evenementen.uitslagen.nl/2013/marathonrotterdam/{0}'
urls = [base.format(x) for x in list_of_things_you_scraped]
I'm not a regex genius, so you may need to fiddle with base_url until you get it exactly right.

If you'd like to use lxml.html to parse html; there is .make_links_absolute():
import lxml.html
html = lxml.html.make_links_absolute(html,
base_href="http://evenementen.uitslagen.nl/2013/marathonrotterdam/")

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

sanitize && build url - python

Related

make_links_absolute() results in broken absolute URLs

Using Beautiful Soup to Scrape content encoded in unicode? [duplicate]

Using BeautifulSoup to Download Links From A WebPage

Predict if sites returns the same content

Scrape specific urls from a page and convert them to absolute urls

Categories

Resources