make_links_absolute() results in broken absolute URLs - python

I need to convert relative URLs from a HTML page to absolute ones. I'm using pyquery for parsing.
For instance, this page http://govp.info/o-gorode/gorozhane has relative URLs in the source code, like
2
(this is the pagination link at the bottom of the page). I'm trying to use make_links_absolute():
import requests
from pyquery import PyQuery as pq
page_url = 'http://govp.info/o-gorode/gorozhane'
resp = requests.get(page_url)
page = pq(resp.text)
page.make_links_absolute(page_url)
but it seems that this breaks the relative links:
print(page.find('a[href*="?page=2"]').attr['href'])
# prints http://govp.info/o-gorode/o-gorode/gorozhane?page=2
# expected value http://govp.info/o-gorode/gorozhane?page=2
As you can see there is doubled o-gorode in the middle of the final URL that definitely will produce 404 error.
Internally pyquery uses urljoin from the standard urllib.parse module, somewhat like this:
from urllib.parse import urljoin
urljoin('http://example.com/one/', 'two')
# -> 'http://example.com/one/two'
It's ok, but there are a lot of sites that have, hmm, unusual relative links with a full path.
And in this case urljoin will give us an invalid absolute link:
urljoin('http://govp.info/o-gorode/gorozhane', 'o-gorode/gorozhane?page=2')
# -> 'http://govp.info/o-gorode/o-gorode/gorozhane?page=2'
I believe such relative links are not very valid, but Google Chrome has no problem to deal with them; so I guess this is kind of normal across the web.
Are there any advice how to solve this problem? I tried furl but it does the same join.

In this particular case, the page in question contains
<base href="http://govp.info/"/>
which instructs the browser to use this for resolving any relative links. The <base> element is optional, but if it's there, you must use it instead of the page's actual URL.
In order to do as the browser does, extract the base href and use it in make_links_absolute().
import requests
from pyquery import PyQuery as pq
page_url = 'http://govp.info/o-gorode/gorozhane'
resp = requests.get(page_url)
page = pq(resp.text)
base = page.find('base').attr['href']
if base is None:
base = page_url # the page's own URL is the fallback
page.make_links_absolute(base)
for a in page.find('a'):
if 'href' in a.attrib and 'govp.info' in a.attrib['href']:
print(a.attrib['href'])
prints
http://govp.info/assets/images/map.png
http://govp.info/podpiska.html
http://govp.info/
http://govp.info/#order
...
http://govp.info/o-gorode/gorozhane
http://govp.info/o-gorode/gorozhane?page=2
http://govp.info/o-gorode/gorozhane?page=3
http://govp.info/o-gorode/gorozhane?page=4
http://govp.info/o-gorode/gorozhane?page=5
http://govp.info/o-gorode/gorozhane?page=6
http://govp.info/o-gorode/gorozhane?page=2
http://govp.info/o-gorode/gorozhane?page=17
http://govp.info/bannerclick/264
...
http://doska.govp.info/cat-biznes-uslugi/
http://doska.govp.info/cat-transport/legkovye-avtomobili/
http://doska.govp.info/
http://govp.info/
which seems to be correct.

Related

How to use requests library to webscrape a list of links already scraped

I have scraped a set of links off a website (https://www.gmcameetings.co.uk) - all the links including the words meetings, i.e. the meeting papers, which are now contained in 'meeting_links'. I now need to follow each of them links to scrape some more links within them.
I've gone back to using the request library and tried
r2 = requests.get("meeting_links")
But it returns the following error:
MissingSchema: Invalid URL 'list_meeting_links': No schema supplied.
Perhaps you meant http://list_meeting_links?
Which I've changed it to but still no difference.
This is my code so far and how I got the links from the first url that I wanted.
# importing libaries and defining
import requests
import urllib.request
import time
from bs4 import BeautifulSoup as bs
# set url
url = "https://www.gmcameetings.co.uk/"
# grab html
r = requests.get(url)
page = r.text
soup = bs(page,'lxml')
# creating folder to store pfds - if not create seperate folder
folder_location = r'E:\Internship\WORK'
# getting all meeting href off url
meeting_links = soup.find_all('a',href='TRUE')
for link in meeting_links:
print(link['href'])
if link['href'].find('/meetings/')>1:
print("Meeting!")
#second set of links
r2 = requests.get("meeting_links")
Do I need to do something with the 'meeting_links' before I can start using the requests library again? I'm completely lost.
As I understand your new requests could be here:
for link in meeting_links:
if link['href'].find('/meetings/')>1:
r2 = requests.get(link['href'])
<Do something with the request>
Because it looks like you are trying to pass a string to the requests method.
Request method should look like this:
requests.get('https://example.com')

How to get URL out of href that is itself a hyperlink?

I'm using Python and lxml to try to scrape this html page. The problem I'm running into is trying to get the URL out of this hyperlink text "Chapter02a". (Note that I can't seem to get the link formatting to work here).
<li>Examples of Operations</li>
I have tried
//ol[#id="ProbList"]/li/a/#href
but that only gives me the text "Chapter02a".
Also:
//ol[#id="ProbList"]/li/a
This returns a lxml.html.HtmlElement'object, and none of the properties that I found in the documentation accomplish what I'm trying to do.
from lxml import html
import requests
chapter_req = requests.get('https://www.math.wisc.edu/~mstemper2/Math/Pinter/Chapter02')
chapter_html = html.fromstring(chapter_req.content)
sections = chapter_html.xpath('//ol[#id="ProbList"]/li/a/#href')
print(sections[0])
I want sections to be a list of URLs to the subsections.
The return you are seeing is correct because Chapter02a is a "relative" link to the next section. The full url is not listed because that is not how it is stored in the html.
To get the full urls you can use:
url_base = 'https://www.math.wisc.edu/~mstemper2/Math/Pinter/'
sections = chapter_html.xpath('//ol[#id="ProbList"]/li/a/#href')
section_urls = [url_base + s for s in sections]
You can also do the concatenation directly at the XPATH level to regenerate the URL from the relative link:
from lxml import html
import requests
chapter_req = requests.get('https://www.math.wisc.edu/~mstemper2/Math/Pinter/Chapter02')
chapter_html = html.fromstring(chapter_req.content)
sections = chapter_html.xpath('concat("https://www.math.wisc.edu/~mstemper2/Math/Pinter/",//ol[#id="ProbList"]/li/a/#href)')
print(sections)
output:
https://www.math.wisc.edu/~mstemper2/Math/Pinter/Chapter02A

Go through to original URL on social media management websites

I'm doing web scraping as part of an academic project, where it's important that all links are followed through to the actual content. Annoyingly, there are some important error cases with "social media management" sites, where users post their links to detect who clicks on them.
For instance, consider this link on linkis.com, which links to http:// + bit.ly + /1P1xh9J (separated link due to SO posting restrictions), which in turn links to http://conservatives4palin.com. The issue arises as the original link at linkis.com does not automatically redirect forward. Instead, the user has to click the cross in the top right corner to go to the original URL.
Furthermore, there seems to be different variations (see e.g. linkis.com link 2, where the cross is at the bottom left of the website). These are the only two variations I've found, but there might be more. Note that I'm using a web scraper very similar to this one. The functionality to go through to the actual link does not need to be stable/functioning over time as this is a one-time academic project.
How do I automatically go on to the original URL? Would the best approach be to design a regex that finds the relevant link?
In many cases, you will have to use browser automation to scrape web pages that generate their content using javascript, scraping the html returned by the a get request will not yield the result you want, you have two options here :
Try to get your way around all the additional javascript requests to get the content you want which can be very time consuming .
Use browser automation, which lets you open a real browser and automates its tasks, you can use Selenium for that.
I have been developing bots and scrapers for years now, and unless the webpage you are requesting does not rely heavily on javascript, you should use something like selenium.
Here is some code to get you started with selenium:
from selenium import webdriver
#Create a chrome browser instance, other drivers are also available
driver = webdriver.Chrome()
#Request a page
driver.get('http://linkis.com/conservatives4palin.com/uGXam')
#Select elements on the page and trigger events
#Selenium supports also xpath and css selectors
#Clicks the tag with the given id
driver.find_elements_by_id('some_id').click()
The common architecture that the website follows is that it shows the website as an iframe. The sample code runs for both the cases.
In order to get the final URL you can do something like this:
import requests
from bs4 import BeautifulSoup
urls = ["http://linkis.com/conservatives4palin.com/uGXam", "http://linkis.com/paper.li/gsoberon/jozY2"]
response_data = []
for url in urls:
response = requests.get(url)
soup = BeautifulSoup(response.text, 'html.parser')
short_url = soup.find("iframe", {"id": "source_site"})['src']
response_data.append(requests.get(short_url).url)
print(response_data)
According to the two websites that you given, i think you may try the following code to get the original url for they all hidden in a part of javascript(the main scraper code i am using is from the question that you post):
try:
from HTMLParser import HTMLParser
except ImportError:
from html.parser import HTMLParser
import requests, re
from contextlib import closing
CHUNKSIZE = 1024
reurl = re.compile("\"longUrl\":\"(.*?)\"")
buffer = ""
htmlp = HTMLParser()
with closing(requests.get("http://linkis.com/conservatives4palin.com/uGXam", stream=True)) as res:
for chunk in res.iter_content(chunk_size=CHUNKSIZE, decode_unicode=True):
buffer = "".join([buffer, chunk])
match = reurl.search(buffer)
if match:
print(htmlp.unescape(match.group(1)).replace('\\',''))
break
say you're able to grab the href attribute/value:
s = 'href="/url/go/?url=http%3A%2F%2Fbit.ly%2F1P1xh9J"'
then you need to do the following:
import urllib.parse
s=s.partition('http')
s=s[1]+urllib.parse.unquote(s[2][0:-1])
s=urllib.parse.unquote(s)
and s will now be a string of the original bit-ly link!
try the following code:
import requests
url = 'http://'+'bit.ly'+'/1P1xh9J'
realsite = requests.get(url)
print(realsite.url)
it prints the desired output:
http://conservatives4palin.com/2015/11/robert-tracinski-the-climate-change-inquisition-begins.html?utm_source=twitterfeed&utm_medium=twitter

Python HTML parsing: getting site top level hosts

I have a program that takes in a site's source code/html and outputs the a href tags - it is extremely helpful and makes use of BeautifulSoup4.
I am wanting to have a variation of this code that only looks at < a href="..."> tags but returns just top directory host names from a site's source codes, for example
stackoverflow.com
google.com
etc. but NOT lower level ones like stackoverflow.com/questions/ etc. Right now it's outputting everything, including /, #t8 etc. and I need to filter them out.
Here is my current code I use to extract all a href tags.
url = sys.argv[1] #when program is invoked, takes it in like www.google.com etc.
html = urllib.urlopen(url).read()
soup = BeautifulSoup(html)
# get hosts
for a in soup.find_all('a', href=True):
print a['href']
Thank you!
It sounds like you're looking for the .netloc attribute of urlparse. It's part of the Python standard library: https://docs.python.org/2/library/urlparse.html
For example:
>>> from urlparse import urlparse
>>> url = "http://stackoverflow.com/questions/26351727/python-html-parsing-getting-site-top-level-hosts"
>>> urlparse(url).netloc
'stackoverflow.com'

Predict if sites returns the same content

I am writing a web crawler, but I have a problem with function which recursively calls links.
Let's suppose I have a page: http://en.wikipedia.org/wiki/Stirling_numbers_of_the_second_kind.
I am looking for all links, and then open each link recursively, downloading again all links etc.
The problem is, that some links, although have different urls, drive to the same page, for example:
http://en.wikipedia.org/wiki/Stirling_numbers_of_the_second_kind#mw-navigation
gives the same page as the previous link.
And I have an infinite loop.
Is any possibility to check if two links drive to the same page without comparing the all content of this pages?
You can store the hash of the content of pages previously seen and check if the page has already been seen before continuing.
No need to make extra requests to the same page.
You can use urlparse() and check if the .path part of the base url and the link you crawl is the same:
from urllib2 import urlopen
from urlparse import urljoin, urlparse
from bs4 import BeautifulSoup
url = "http://en.wikipedia.org/wiki/Stirling_numbers_of_the_second_kind"
base_url = urlparse(url)
soup = BeautifulSoup(urlopen(url))
for link in soup.find_all('a'):
if 'href' in link.attrs:
url = urljoin(url, link['href'])
print url, urlparse(url).path == base_url.path
Prints:
http://en.wikipedia.org/wiki/Stirling_numbers_of_the_second_kind#mw-navigation True
http://en.wikipedia.org/wiki/Stirling_numbers_of_the_second_kind#p-search True
http://en.wikipedia.org/wiki/File:Set_partitions_4;_Hasse;_circles.svg False
...
http://en.wikipedia.org/wiki/Equivalence_relation False
...
http://en.wikipedia.org/wiki/Stirling_numbers_of_the_second_kind True
...
https://www.mediawiki.org/ False
This particular example uses BeautifulSoup to parse the wikipedia page and get all links, but the actual html parser here is not really important. Important is that you parse the links and get the path to check.

Categories

Resources