check if the page is HTML page in python? - python

I am trying to write a code in python for Web crawler. I want to check if the page I am about to crawl is a HTML page and not page like .pdf/.doc/.docx etc.. I do not want to check it with extension .html as asp,aspx, or pages like http://bing.com/travel/ do not .html extensions explicitly but they are html pages. Is there any good way in python?

This gets the header only from the server:
import urllib2
url = 'http://www.kernel.org/pub/linux/kernel/v3.0/testing/linux-3.7-rc6.tar.bz2'
req = urllib2.Request(url)
req.get_method = lambda: 'HEAD'
response = urllib2.urlopen(req)
content_type = response.headers.getheader('Content-Type')
print(content_type)
prints
application/x-bzip2
From which you could conclude this is not HTML. You could use
'html' in content_type
to programmatically test if the content is HTML (or possibly XHTML).
If you wanted to be even more sure the content is HTML you could download the contents and try to parse it with an HTML parser like lxml or BeautifulSoup.
Beware of using requests.get like this:
import requests
r = requests.get(url)
print(r.headers['content-type'])
This takes a long time and my network monitor shows a sustained load leading me to believe this is downloading the entire file, not just the header.
On the other hand,
import requests
r = requests.head(url)
print(r.headers['content-type'])
gets the header only.

Don't bother with what the standard library throws at you but, rather try requests.
>>> import requests
>>> r = requests.get("http://www.google.com")
>>> r.headers['content-type']
'text/html; charset=ISO-8859-1'

Related

Python Requests Library - Scraping separate JSON and HTML responses from POST request

I'm new to web scraping, programming, and StackOverflow, so I'll try to phrase things as clearly as I can.
I'm using the Python requests library to try to scrape some info from a local movie theatre chain. When I look at the Chrome developer tools response/preview tabs in the network section, I can see what appears to be very clean and useful JSON:
However, when I try to use requests to obtain this same info, instead I get the entire page content (pages upon pages of html). Upon further inspection of the cascade in the Chrome developer tools, I can see there are two events called GetNowPlayingByCity: One contains the JSON info while the other seems to be the HTML.
JSON Response
HTML Response
How can I separate the two and only obtain the JSON response using the Python requests library?
I have already tried modifying the headers within requests.post (the Chrome developer tools indicate this is a post method) to include "accept: application/json, text/plain, */*" but didn't see a difference in the response I was getting with requests.post. As it stands I can't parse any JSON from the response I get with requests.post and get the following error:
"json.decoder.JSONDecodeError: Expecting value: line 4 column 1 (char 3)"
I can always try to parse the full HTML, but it's so long and complex I would much rather work with friendly JSON info. Any help would be much appreciated!
This is probably because the javascript the page sends to your browser is making a request to an API to get the json info about the movies.
You could either try sending the request directly to their API (see edit 2), parse the html with a library like Beautiful Soup or you can use a dedicated scraping library in python. I've had great experiences with scrapy. It is much faster than requests
Edit:
If the page uses dynamically loaded content, which I think is the case, you'd have to use selenium with the PhantomJS browser instead of requests. here is an example:
from bs4 import BeautifulSoup
from selenium import webdriver
url = "your url"
browser = webdriver.PhantomJS()
browser.get(url)
html = browser.page_source
soup = BeautifulSoup(html, 'lxml')
# Then parse the html code here
Or you could load the dynamic content with scrapy
I recommend the latter if you want to get into scraping. It would take a bit more time to learn but it is a better solution.
Edit 2:
To make a request directly to their api you can just reproduce the request you see. Using google chrome, you can see the request if you click on it and go to 'Headers':
After that, you simply reproduce the request using the requests library:
import requests
import json
url = 'http://paste.the.url/?here='
response = requests.get(url)
content = response.content
# in my case content was byte string
# (it looks like b'data' instead of 'data' when you print it)
# if this is you case, convert it to string, like so
content_string = content.decode()
content_json = json.loads(content_string)
# do whatever you like with the data
You can modify the url as you see fit, for example if it is something like http://api.movies.com/?page=1&movietype=3 you could modify movietype=3 to movietype=2 to see a different type of movie, etc

how can i get whole web page include the fragment web

i've tried with with urllib and request library but the data in fragment was not written in .html file. help me please :(
Here with the request
url = 'https://xxxxxxxxxxx.co.jp/InService/delivery/#/V=2/partsList/Element.PartsList%3A%3AVj0xfnsicklkIjoiQzEtQlVMTERPWkVSLUxfSVNfQzNfLl9CVUxMRE9aRVItTF8uXzgwXy5fRDg1RVNTLTJfLl9LSSIsIm9wIjpbIkMxLUJVTExET1pFUi1MX0lTX0MzXy5fQlVMTERPWkVSLUxfLl84MF8uX0Q4NUVTUy0yXy5fS0kiLCJJU19QQl8uX0Q4NUVTUy0yXy5fS0ktMDAwMDMiLCJJU19QQl8uX0Q4NUVTUy0yXy5fS0ktMDAwMDNfLl9BMCIsIlBMX0MxLUJVTExET1pFUi1MX0FDXy5fRDg1RVNTLTJfLl9LSS0wMDAwM18uX0EwMDEwMDEwIl0sIm5uIjoyMTQsInRzIjoxNTc5ODM0OTIwMDE5fQ?filterId=Product%3A%3AVj0xfnsicklkIjoiUk9PVCBQUk9EVUNUIiwib3AiOlsiUk9PVCBQUk9EVUNUIiwiQzEtQlVMTERPWkVSLUwiLCJDMl8uX0JVTExET1pFUi1MXy5fODAiLCJDM18uX0JVTExET1pFUi1MXy5fODBfLl9EODVFU1MtMl8uX0tJIl0sIm5uIjo2OTcsInRzIjoxNTc2NTY0MjMwMDg1fQ&bomFilterState=false'
response = requests.get(url)
print(response)
here with the urllib
url = 'https://xxxxxxx.co.jp/InService/delivery/?view=print#/V=2/partsList/Element.PartsList::Vj0xfnsicklkIjoiQzEtQlVMTERPWkVSLUxfSVNfQzNfLl9CVUxMRE9aRVItTF8uXzgwXy5fRDg1RVNTLTJfLl9LSSIsIm9wIjpbIkMxLUJVTExET1pFUi1MX0lTX0MzXy5fQlVMTERPWkVSLUxfLl84MF8uX0Q4NUVTUy0yXy5fS0kiLCJJU19QQl8uX0Q4NUVTUy0yXy5fS0ktMDAwMDMiLCJJU19QQl8uX0Q4NUVTUy0yXy5fS0ktMDAwMDNfLl9BMCIsIlBMX0MxLUJVTExET1pFUi1MX0FDXy5fRDg1RVNTLTJfLl9LSS0wMDAwM18uX0EwMDEwMDIwIl0sIm5uIjoyMjUsInRzIjoxNTgwMDk1MDYzNjIyfQ?filterId=Product::Vj0xfnsicklkIjoiUk9PVCBQUk9EVUNUIiwib3AiOlsiUk9PVCBQUk9EVUNUIiwiQzEtQlVMTERPWkVSLUwiLCJDMl8uX0JVTExET1pFUi1MXy5fODAiLCJDM18uX0JVTExET1pFUi1MXy5fODBfLl9EODVFU1MtMl8uX0tJIl0sIm5uIjo2OTcsInRzIjoxNTc2NTY0MjMwMDg1fQ&bomFilterState=false'
request = urllib.request.Request(url)
string = '%s:%s' % ('xx','xx')
base64string = base64.standard_b64encode(string.encode('utf-8'))
request.add_header("Authorization", "Basic %s" % base64string.decode('utf-8'))
u = urllib.request.urlopen(request)
webContent = u.read()
here is home of the web page (url:https://xxxxxx.co.jp/InService/delivery/#/V=2/home)
and here is the page that i want to get the data (url: https://xxxxxxx.co.jp/InService/delivery/?view=print#/V=2/partsList/Element.PartsList::Vj0xfnsicklkIjoiQzE...)
so every i request the web page like in the 2 picture, the html content is must be the html in picture 1 because in picture 2 is the fragment
If all you would like is the html of the webpage, just use requests as you have in the first example, except instead of print(response) use print(response.content).
To save it into a file use:
import requests
url = 'https://xxxxxxx.co.jp/InService/delivery/?view=print#/V=2/partsList/Element.PartsList::Vj0xfnsicklkIjoiQzEtQlVMTERPWkVSLUxfSVNfQzNfLl9CVUxMRE9aRVItTF8uXzgwXy5fRDg1RVNTLTJfLl9LSSIsIm9wIjpbIkMxLUJVTExET1pFUi1MX0lTX0MzXy5fQlVMTERPWkVSLUxfLl84MF8uX0Q4NUVTUy0yXy5fS0kiLCJJU19QQl8uX0Q4NUVTUy0yXy5fS0ktMDAwMDMiLCJJU19QQl8uX0Q4NUVTUy0yXy5fS0ktMDAwMDNfLl9BMCIsIlBMX0MxLUJVTExET1pFUi1MX0FDXy5fRDg1RVNTLTJfLl9LSS0wMDAwM18uX0EwMDEwMDIwIl0sIm5uIjoyMjUsInRzIjoxNTgwMDk1MDYzNjIyfQ?filterId=Product::Vj0xfnsicklkIjoiUk9PVCBQUk9EVUNUIiwib3AiOlsiUk9PVCBQUk9EVUNUIiwiQzEtQlVMTERPWkVSLUwiLCJDMl8uX0JVTExET1pFUi1MXy5fODAiLCJDM18uX0JVTExET1pFUi1MXy5fODBfLl9EODVFU1MtMl8uX0tJIl0sIm5uIjo2OTcsInRzIjoxNTc2NTY0MjMwMDg1fQ&bomFilterState=false'
with open("output.html", 'w+') as f:
response = requests.get(url)
f.write(response.content)
If you need a certain part of the webpage, use BeautifulSoup.
import requests
from bs4 import BeautifulSoup
url = 'https://xxxxxxx.co.jp/InService/delivery/?view=print#/V=2/partsList/Element.PartsList::Vj0xfnsicklkIjoiQzEtQlVMTERPWkVSLUxfSVNfQzNfLl9CVUxMRE9aRVItTF8uXzgwXy5fRDg1RVNTLTJfLl9LSSIsIm9wIjpbIkMxLUJVTExET1pFUi1MX0lTX0MzXy5fQlVMTERPWkVSLUxfLl84MF8uX0Q4NUVTUy0yXy5fS0kiLCJJU19QQl8uX0Q4NUVTUy0yXy5fS0ktMDAwMDMiLCJJU19QQl8uX0Q4NUVTUy0yXy5fS0ktMDAwMDNfLl9BMCIsIlBMX0MxLUJVTExET1pFUi1MX0FDXy5fRDg1RVNTLTJfLl9LSS0wMDAwM18uX0EwMDEwMDIwIl0sIm5uIjoyMjUsInRzIjoxNTgwMDk1MDYzNjIyfQ?filterId=Product::Vj0xfnsicklkIjoiUk9PVCBQUk9EVUNUIiwib3AiOlsiUk9PVCBQUk9EVUNUIiwiQzEtQlVMTERPWkVSLUwiLCJDMl8uX0JVTExET1pFUi1MXy5fODAiLCJDM18uX0JVTExET1pFUi1MXy5fODBfLl9EODVFU1MtMl8uX0tJIl0sIm5uIjo2OTcsInRzIjoxNTc2NTY0MjMwMDg1fQ&bomFilterState=false'
response = BeautifulSoup(requests.get(url).content)
use inspect element and find the Tag of the table that you want in the second image, eg. https://imgur.com/a/pGbCCFy.
then use:
found = response.find('div', attrs={"class":"x-carousel__body no-scroll"}).find_all('ul')
For the ebay example I linked above.
This should return that table which you can then do whatever you like with.

Scraping website in which html is injected with javascript

I am trying to get the url and sneaker titles at https://stockx.com/sneakers.
This is my code so far:
in main.py
from bs4 import BeautifulSoup
from utils import generate_request_header
import requests
url = "https://stockx.com/sneakers"
html = requests.get(url, headers=generate_request_header()).content
soup = BeautifulSoup(html, "lxml")
print soup
in utils.py
def generate_request_header():
header = BASE_REQUEST_HEADER
header["User-Agent"] = random.choice(USER_AGENT_HEADER_LIST)
return header
But whenever I print soup, I get the following output: https://pastebin.com/Ua6B6241. There doesn't seem to be any HTML extracted. How would I get it? Should I be using something like Selenium?
requests doesn't seem to be able to verify the ssl certificates, to temporarily bypass this error, you can use verify=False, i.e.:
requests.get(url, headers=generate_request_header(), verify=False)
To fix it permanently, you may want to read:
http://docs.python-requests.org/en/master/user/advanced/#ssl-cert-verification
I'm guessing the data you're looking for are at line 126 in the pastebin. I've never tried to extract the text of a script but I'm sure it could be done.
In lxml, something like:
source_code.xpath('//script[#type="text/javascript"]') should return a list of all the scripts as objects.
Or to try and get straight to the "tickers":
[i for i in source_code.xpath('//script[#type="text/javascript"]') if 'tickers' in i.xpath('string')]

Html source code of https pages different when fetched manually vs. with HTTPConnection

I'm new to python and I've been trying to get the html source code of 'https' pages. Thanks to a previous question, I am now able to extract part of the source code, but not as much as when I manually open the page and look at the source.
Is there a simple way to fetch the entire code that I see when I open the source of an HTTPS page manually using python?
Here's the code I'm currently using:
import http.client
from urllib.parse import urlparse
url = "https://www.google.ca/?gfe_rd=cr&ei=u6d_VbzoMaei8wfE1oHgBw&gws_rd=ssl#q=test"
p = urlparse(url)
conn = http.client.HTTPConnection(p.netloc)
conn.request('GET', p.path)
resp = conn.getresponse()
text_file = open("google_test_python.txt", "wb")
for i in resp:
text_file.write(i)
text_file.close()

How to create a Python script that grabs text from one site and reposts it to another?

I would like to create a Python script that grabs digits of Pi from this site:
http://www.piday.org/million.php
and reposts them to this site:
http://www.freelove-forum.com/index.php
I am NOT spamming or playing a prank, it is an inside joke with the creator and webmaster, a belated Pi day celebration, if you will.
Import urllib2 and BeautifulSoup
import urllib2
from BeautifulSoup import BeautifulSoup
specify the url and fetch using urllib2
url = 'http://www.piday.org/million.php'
response = urlopen(url)
and then use BeautifulSoup which uses the tags in the page to build a dictionary, and then you can query the dictionary with the relevant tags that define the data to extract what you want.
soup = BeautifulSoup(response)
pi = soup.findAll('TAG')
where 'TAG' is the relevant tag you want to find that identifies where pi is.
Specify what you want to print out
out = '<html><body>'+pi+'</html></body>
You can then write this to a HTML file that you serve, using pythons inbuilt file operations.
f = open('file.html', 'w')
f.write(out)
f.close()
You then serve the file 'file.html' using your webserver.
If you don't want to use BeautifulSoup you could use re and urllib, but it is not as 'pretty' as BeautifulSoup.
When you post a post, it's done with a POST request which is sent to the server. Look at the code on your site:
<form action="enter.php" method="post">
<textarea name="post">Enter text here</textarea>
</form>
You are going to send a POST request with a parameter of post (bad object naming IMO), which is your text.
As for the site you are grabbing from, if you look at the source code, the Pi is actually inside of an <iframe> with this URL:
http://www.piday.org/includes/pi_to_1million_digits_v2.html
Looking at that source code, you can see that the page is just a single <p> tag directly descending from a <body> tag (the site doesn't have the <!DOCTYPE>, but I'll include one):
<!DOCTYPE html>
<html>
<head>
...
</head>
<body>
<p>3.1415926535897932384...</p>
</body>
</html>
Since HTML is a form of XML, you will need to use an XML parser to parse the webpage. I use BeautifulSoup, as it works very well with malformed or invalid XML, but even better with perfectly valid HTML.
To download the actual page, which you would feed into the XML parser, you can use Python's built-in urllib2. For the POST request, I'd use Python's standard httplib.
So a complete example would be this:
import urllib, httplib
from BeautifulSoup import BeautifulSoup
# Downloads and parses the webpage with Pi
page = urllib.urlopen('http://www.piday.org/includes/pi_to_1million_digits_v2.html')
soup = BeautifulSoup(page)
# Extracts the Pi. There's only one <p> tag, so just select the first one
pi_list = soup.findAll('p')[0].contents
pi = ''.join(str(s).replace('\n', '') for s in pi_list).replace('<br />', '')
# Creates the POST request's body. Still bad object naming on the creator's part...
parameters = urllib.urlencode({'post': pi,
'name': 'spammer',
'post_type': 'confession',
'school': 'all'})
# Crafts the POST request's header.
headers = {'Content-type': 'application/x-www-form-urlencoded',
'Accept': 'text/plain'}
# Creates the connection to the website
connection = httplib.HTTPConnection('freelove-forum.com:80')
connection.request('POST', '/enter.php', parameters, headers)
# Sends it out and gets the response
response = connection.getresponse()
print response.status, response.reason
# Finishes the connections
data = response.read()
connection.close()
But if you are using this for a malicious purpose, do know that the server logs all IP addresses.
You could use the urllib2 module which come in any Python distribution.
It allows you to open an URL as you were opening a file on the filesystem. So you can fetch the PI data with
pi_million_file = urllib2.urlopen("http://www.piday.org/million.php")
parse the resulting file which will be the HTML code of the webpage you see in your browser.
Then you should use the right URL for your website to POST with PI.

Categories

Resources