Get URL of the page in python - python

My question might be bit weird.
So I have some page with different URL but all end up on same page. So Can I get that main URL from the old URL in python. For example:
1) https://www.verisk.com/insurance/products/iso-forms/
2) https://www.verisk.com/insurance/products/forms-library-on-isonet/
Both will end up on same page that is:
https://www.verisk.com/insurance/products/iso-forms/
So for each URL can I know the final URL where it'll land using Python(I have list of 1k URL). And I want another list of where those URL land correspondingly!

Here's one way of doing it, using requests library.
import requests
def get_redirected_url(url):
response = requests.get(url, stream=True) # stream=True prevents fetching the actual content
return response.url
This is a very simplified example, and in real implementation, you want to error handle, probably do delayed retries and possibly check what kind of redirection you're getting. (permanent redirects only?)

Simple approach with urllib.request:
from urllib.request import urlopen
resp = urlopen("http://sitey.com/redirect")
print(resp.url)
Might want to use threads if you're doing 1,000 URLs...

Related

Python Requests Library - Scraping separate JSON and HTML responses from POST request

I'm new to web scraping, programming, and StackOverflow, so I'll try to phrase things as clearly as I can.
I'm using the Python requests library to try to scrape some info from a local movie theatre chain. When I look at the Chrome developer tools response/preview tabs in the network section, I can see what appears to be very clean and useful JSON:
However, when I try to use requests to obtain this same info, instead I get the entire page content (pages upon pages of html). Upon further inspection of the cascade in the Chrome developer tools, I can see there are two events called GetNowPlayingByCity: One contains the JSON info while the other seems to be the HTML.
JSON Response
HTML Response
How can I separate the two and only obtain the JSON response using the Python requests library?
I have already tried modifying the headers within requests.post (the Chrome developer tools indicate this is a post method) to include "accept: application/json, text/plain, */*" but didn't see a difference in the response I was getting with requests.post. As it stands I can't parse any JSON from the response I get with requests.post and get the following error:
"json.decoder.JSONDecodeError: Expecting value: line 4 column 1 (char 3)"
I can always try to parse the full HTML, but it's so long and complex I would much rather work with friendly JSON info. Any help would be much appreciated!
This is probably because the javascript the page sends to your browser is making a request to an API to get the json info about the movies.
You could either try sending the request directly to their API (see edit 2), parse the html with a library like Beautiful Soup or you can use a dedicated scraping library in python. I've had great experiences with scrapy. It is much faster than requests
Edit:
If the page uses dynamically loaded content, which I think is the case, you'd have to use selenium with the PhantomJS browser instead of requests. here is an example:
from bs4 import BeautifulSoup
from selenium import webdriver
url = "your url"
browser = webdriver.PhantomJS()
browser.get(url)
html = browser.page_source
soup = BeautifulSoup(html, 'lxml')
# Then parse the html code here
Or you could load the dynamic content with scrapy
I recommend the latter if you want to get into scraping. It would take a bit more time to learn but it is a better solution.
Edit 2:
To make a request directly to their api you can just reproduce the request you see. Using google chrome, you can see the request if you click on it and go to 'Headers':
After that, you simply reproduce the request using the requests library:
import requests
import json
url = 'http://paste.the.url/?here='
response = requests.get(url)
content = response.content
# in my case content was byte string
# (it looks like b'data' instead of 'data' when you print it)
# if this is you case, convert it to string, like so
content_string = content.decode()
content_json = json.loads(content_string)
# do whatever you like with the data
You can modify the url as you see fit, for example if it is something like http://api.movies.com/?page=1&movietype=3 you could modify movietype=3 to movietype=2 to see a different type of movie, etc

Invalid Argument in Open method for web scraping

I am trying to scrape some data from the ancestry, I have a .net background but thought i'd try a bit of python for a project.
I'm falling at the first step, Firstly i am trying to open this page and then just print out the rows.
from requests import get
from requests.exceptions import RequestException
from contextlib import closing
from bs4 import BeautifulSoup
raw_html = open('https://www.ancestry.co.uk/search/collections/britisharmyservice/?
birth=_merthyr+tydfil-wales-united+kingdom_1651442').read()
html = BeautifulSoup(raw_html, 'html.parser')
for p in html.select('tblrow record'):
print(p)
I am getting an illegal argument on open.
According to documentation, open is used to:
Open [a] file and return a corresponding file object.
As such, you cannot use it for downloading the HTML contents of a webpage. You probably meant to use requests.get as follows:
raw_html = get('https://www.ancestry.co.uk/search/collections/britisharmyservice/?
birth=_merthyr+tydfil-wales-united+kingdom_1651442').text
# .text gets the raw text of the response
# (http://docs.python-requests.org/en/master/api/#requests.Response.text)
Here are a few recommendation to improve your code as well:
requests.get provides many useful parameters, one of them being params, which allows you to provide the URL parameters in the form of a Python dictionary.
If you need to verify whether the request was successful before accessing its text, then just check if the returned response.status_code == requests.codes.ok. This only covers status code 200, but if you need more codes, then response.raise_for_status should be helpful.

Scraping Metacritic with urllib to follow redirect

I'm working on a Python script to scrape information from Metacritic. It works fine for most movies but it has issues with movies that Metacritic redirects.
For example on the list of movies, Metacritic provides the url "/movie/red-riding-in-the-year-of-our-lord-1983" but when you click that URL it brings you to "/movie/red-riding-trilogy". I need urllib to fetch the HTML of the final URL it ends up at.
Try using,
import urllib.request
urllib.request.FancyURLopener().open_http("your url")
I ended up using the requests module. (http://docs.python-requests.org/en/latest/) Here is the code for the request and the line to save the final url.
response = requests.get(url)
newUrl = response.url

check if the page is HTML page in python?

I am trying to write a code in python for Web crawler. I want to check if the page I am about to crawl is a HTML page and not page like .pdf/.doc/.docx etc.. I do not want to check it with extension .html as asp,aspx, or pages like http://bing.com/travel/ do not .html extensions explicitly but they are html pages. Is there any good way in python?
This gets the header only from the server:
import urllib2
url = 'http://www.kernel.org/pub/linux/kernel/v3.0/testing/linux-3.7-rc6.tar.bz2'
req = urllib2.Request(url)
req.get_method = lambda: 'HEAD'
response = urllib2.urlopen(req)
content_type = response.headers.getheader('Content-Type')
print(content_type)
prints
application/x-bzip2
From which you could conclude this is not HTML. You could use
'html' in content_type
to programmatically test if the content is HTML (or possibly XHTML).
If you wanted to be even more sure the content is HTML you could download the contents and try to parse it with an HTML parser like lxml or BeautifulSoup.
Beware of using requests.get like this:
import requests
r = requests.get(url)
print(r.headers['content-type'])
This takes a long time and my network monitor shows a sustained load leading me to believe this is downloading the entire file, not just the header.
On the other hand,
import requests
r = requests.head(url)
print(r.headers['content-type'])
gets the header only.
Don't bother with what the standard library throws at you but, rather try requests.
>>> import requests
>>> r = requests.get("http://www.google.com")
>>> r.headers['content-type']
'text/html; charset=ISO-8859-1'

Data from a custom Google Map

I would like to collect data from this page
xxx
My experience level with python and BeautifulSoup is beginner. However I don't think that it has to be very advanced for what I need to do, excepting for the issue that I am describing below
The page that I need to collect data from lists the active properties for sale listed on MLS for the Greater Toronto Area.
At the right side of the map there are some checkboxes that you must select in order to get your data and this is where my problem is. If I use a browser a local cookie is used to remember the previous selections and tha data is displayed
I would like to know either of these:
1) how I can pass all the params (selections) in my initial request from Python
2) how to use the Chrome cookie with Python so I can get a page return that actually contains data
A code example would be great but sending me to links that I should read would also work.
Thanks a lot
PF
If you insist on using urllib2 over Requests, I suggest looking into cookielib.
Here is an example:
import urllib2
import cookielib
from BeautifulSoup import BeautifulSoup
cookiejar = cookielib.CookieJar()
opener = urllib2.build_opener(
urllib2.HTTPRedirectHandler(),
urllib2.HTTPHandler(debuglevel=0),
urllib2.HTTPSHandler(debuglevel=0),
urllib2.HTTPCookieProcessor(cookiejar),
)
This way you're creating a cookiejar to hold cookies, building an opener and establishing your cookie processor and passing cookiejar. This should take care of your cookie issue. At this point, instead of using urllib2.urlopen(url), just use your custom opener: opener.open(url)
url = 'http://www.somesite.com/'
fp = opener.open(url)
html_object = BeautifulSoup(fp)

Categories

Resources