How to pass arguments for get method with urllib? - python

The response web page is as below when to slect title and input wordpress.
Here is my python code to pass arguments for get method with python3.
import urllib.request
import urllib.parse
url = 'http://www.it-ebooks.info/'
values = {'q': 'wordpress','type': 'title'}
data = urllib.parse.urlencode(values).encode(encoding='utf-8',errors='ignore')
headers = { 'User-Agent' : 'Mozilla/5.0 (Windows NT 10.0; WOW64; rv:50.0) Gecko/20100101 Firefox/50.0' }
request = urllib.request.Request(url=url, data=data,headers=headers,method='GET')
response = urllib.request.urlopen(request)
buff = response.read()
html = buff.decode("utf8")
print(html)
I can't get the desired output web page.
How to pass arguments for get method with urllib in my example?

The data kwarg of urllib.request.Request is only used for POST requests as it modifies the request's body.
GET requests simply use URL parameters, so you should append these to the url:
params = '?q=wordpress&type=title'
url = 'http://www.it-ebooks.info/search/{}'.format(params)
You can of course take the time and generalize this into a generic function.

is better if you use the library called requests
import requests
headers = {
'DNT': '1',
'Accept-Encoding': 'gzip, deflate, sdch',
'Accept-Language': 'es-ES,es;q=0.8,en;q=0.6',
'Upgrade-Insecure-Requests': '1',
'User-Agent': 'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/56.0.2924.87 Safari/537.36',
'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,*/*;q=0.8',
'Referer': 'http://www.it-ebooks.info/',
'Connection': 'keep-alive',
}
r = requests.get('http://www.it-ebooks.info/search/?q=wordpress&type=title', headers=headers)
print r.content

Related

Python program times-out when hitting this website

Why does this function fail to read XML from "https://www.seattletimes.com/feed/"?
I can visit the URL from my browser just fine. It also reads XML from other websites without a problem ("https://news.ycombinator.com/rss").
import urllib
def get_url(u):
header = {'User-Agent': 'Mozilla/5.0'}
request = urllib.request.Request(url=url, headers=header)
response = urllib.request.urlopen(request)
return response.read().decode('utf-8')
url = 'https://www.seattletimes.com/feed/'
feed = get_url(url)
print(feed)
The program times out every time.
Ideas?:
Maybe header need more info (Accept, etc.)?
EDIT1:
I replaced with the request header from the script with my browser header. Still no-go.
header = {
'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/avif,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3;q=0.9',
'Accept-Encoding': 'gzip, deflate',
'Accept-Language': 'en-US,en;q=0.9',
'Connection': 'keep-alive',
'Accept-Language': 'en-US,en;q=0.9',
'Connection': 'keep-alive',
'Upgrade-Insecure-Requests': '1',
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/96.0.4664.110 Safari/537.36' }
I am not quite sure why but the header/user-agent was confusing the website. If you remove it your code works just fine. I've tried different header arguments without issues, the user-agent seems to be what causes that behaviour.
import urllib.request
def get_url(u):
request = urllib.request.Request(url=url)
response = urllib.request.urlopen(request)
return response.read().decode('utf-8')
url = 'https://www.seattletimes.com/feed/'
feed = get_url(url)
print(feed)
After some debugging I have found a legal header combination (keep in mind I consider this a bug on their end):
header = {
'User-Agent': 'Mozilla/5.0',
'Cookie': 'PHPSESSID=kfdkdofsdj99g36l443862qeq2',
'Accept-Language': "de-DE,de;q=0.9,en-US;q=0.8,en;q=0.7",}

scrapy with payload request

I'm trying to get a POST request, but I don't know what's wrong with my code that the data doesn't come.
The following message is displayed:
HTTP status code is not handled or not allowed
This is the website
A screenshot of the header:
This is my code:
import json
import scrapy
class MySpider(scrapy.Spider):
name = 'pb'
payload = {"version":"1.0.0","queries":[{"Query":{"Commands":[{"SemanticQueryDataShapeCommand":{"Query":{"Version":2,"From":[{"Name":"e","Entity":"Events"},{"Name":"d","Entity":"DAX"}],"Select":[{"Column":{"Expression":{"SourceRef":{"Source":"e"}},"Property":"Date Start"},"Name":"Events.Date Start"},{"Column":{"Expression":{"SourceRef":{"Source":"e"}},"Property":"Event Type"},"Name":"Events.Event Type"},{"Column":{"Expression":{"SourceRef":{"Source":"e"}},"Property":"Name"},"Name":"Events.Name"},{"Measure":{"Expression":{"SourceRef":{"Source":"d"}},"Property":"Length"},"Name":"Events.Total Days"},{"Column":{"Expression":{"SourceRef":{"Source":"e"}},"Property":"Location"},"Name":"Events.Location"},{"Column":{"Expression":{"SourceRef":{"Source":"e"}},"Property":"Link to Event"},"Name":"Events.Link to Event"},{"Measure":{"Expression":{"SourceRef":{"Source":"d"}},"Property":"Days Until Event"},"Name":"DAX.Days Until"},{"Column":{"Expression":{"SourceRef":{"Source":"e"}},"Property":"Link to Submit"},"Name":"Events.Link to Submit"},{"Measure":{"Expression":{"SourceRef":{"Source":"d"}},"Property":"Event Type Number"},"Name":"DAX.Event Type Number"}],"OrderBy":[{"Direction":1,"Expression":{"Column":{"Expression":{"SourceRef":{"Source":"e"}},"Property":"Date Start"}}}]},"Binding":{"Primary":{"Groupings":[{"Projections":[0,1,2,3,4,5,6,7,8]}]},"DataReduction":{"DataVolume":3,"Primary":{"Window":{"Count":500}}},"Aggregates":[{"Select":3,"Aggregations":[{"Min":{}},{"Max":{}}]}],"SuppressedJoinPredicates":[8],"Version":1}}}]},"CacheKey":"{\"Commands\":[{\"SemanticQueryDataShapeCommand\":{\"Query\":{\"Version\":2,\"From\":[{\"Name\":\"e\",\"Entity\":\"Events\"},{\"Name\":\"d\",\"Entity\":\"DAX\"}],\"Select\":[{\"Column\":{\"Expression\":{\"SourceRef\":{\"Source\":\"e\"}},\"Property\":\"Date Start\"},\"Name\":\"Events.Date Start\"},{\"Column\":{\"Expression\":{\"SourceRef\":{\"Source\":\"e\"}},\"Property\":\"Event Type\"},\"Name\":\"Events.Event Type\"},{\"Column\":{\"Expression\":{\"SourceRef\":{\"Source\":\"e\"}},\"Property\":\"Name\"},\"Name\":\"Events.Name\"},{\"Measure\":{\"Expression\":{\"SourceRef\":{\"Source\":\"d\"}},\"Property\":\"Length\"},\"Name\":\"Events.Total Days\"},{\"Column\":{\"Expression\":{\"SourceRef\":{\"Source\":\"e\"}},\"Property\":\"Location\"},\"Name\":\"Events.Location\"},{\"Column\":{\"Expression\":{\"SourceRef\":{\"Source\":\"e\"}},\"Property\":\"Link to Event\"},\"Name\":\"Events.Link to Event\"},{\"Measure\":{\"Expression\":{\"SourceRef\":{\"Source\":\"d\"}},\"Property\":\"Days Until Event\"},\"Name\":\"DAX.Days Until\"},{\"Column\":{\"Expression\":{\"SourceRef\":{\"Source\":\"e\"}},\"Property\":\"Link to Submit\"},\"Name\":\"Events.Link to Submit\"},{\"Measure\":{\"Expression\":{\"SourceRef\":{\"Source\":\"d\"}},\"Property\":\"Event Type Number\"},\"Name\":\"DAX.Event Type Number\"}],\"OrderBy\":[{\"Direction\":1,\"Expression\":{\"Column\":{\"Expression\":{\"SourceRef\":{\"Source\":\"e\"}},\"Property\":\"Date Start\"}}}]},\"Binding\":{\"Primary\":{\"Groupings\":[{\"Projections\":[0,1,2,3,4,5,6,7,8]}]},\"DataReduction\":{\"DataVolume\":3,\"Primary\":{\"Window\":{\"Count\":500}}},\"Aggregates\":[{\"Select\":3,\"Aggregations\":[{\"Min\":{}},{\"Max\":{}}]}],\"SuppressedJoinPredicates\":[8],\"Version\":1}}}]}","QueryId":"","ApplicationContext":{"DatasetId":"6427f3c6-42f6-4287-b061-c31c1d2e7ae0","Sources":[{"ReportId":"6e442642-8594-4894-bc32-0ab7f4620772"}]}}],"cancelQueries":[],"modelId":1226835}
user_agent = 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/79.0.3945.117 Safari/537.36'
def start_requests(self):
yield scrapy.Request(
url='https://wabi-australia-southeast-api.analysis.windows.net/public/reports/querydata?synchronous=true',
method='POST',
body=json.dumps(self.payload),
headers={
'Accept-Language': 'pt-BR,pt;q=0.9,en;q=0.8',
'ActivityId': '1d3ecdc2-5dc0-801e-4140-82a258f127a6',
'Connection': 'keep-alive',
'Content-Length': '3462',
'Content-Type': 'application/json;charset=UTF-8',
'Host': 'wabi-australia-southeast-api.analysis.windows.net',
'Origin': 'https://app.powerbi.com',
'Referer': 'https://app.powerbi.com/view?r=eyJrIjoiMGIwNTY2MjgtMzJhYy00MzEwLTk5MDAtYTI2MGVlMzk1NjM2IiwidCI6IjZmMGU5YzQyLTk2Y2UtNDU1MS05NzAxLWJhMzFkMGQ2ZDE5ZSJ9',
'RequestId': '11c18fe6-00da-7df4-952c-98ba7bdf188e',
'Sec-Fetch-Mode': 'cors',
'Sec-Fetch-Site': 'cross-site',
'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/79.0.3945.117 Safari/537.36',
'X-PowerBI-ResourceKey': '0b056628-32ac-4310-9900-a260ee395636'
}
)
def parse(self, response):
items = json.loads(response.text)
yield {"data":items}
The request in your screenshot is a GET request.
The behaviour of this website is very interesting!
Let's examine it.
By looking at the network panel we can see that GET request is being made to some complex url with many various headers. However It seems that the header X-PowerBI-ResourceKey is the only one that's needed and it controls what content the request will return.
So all we need to replicate this is find the X-PowerBI-ResourceKey value.
If you take a look at the source code of the html page:
https://app.powerbi.com/view?r=eyJrIjoiMGIwNTY2MjgtMzJhYy00MzEwLTk5MDAtYTI2MGVlMzk1NjM2IiwidCI6IjZmMGU5YzQyLTk2Y2UtNDU1MS05NzAxLWJhMzFkMGQ2ZDE5ZSJ9
Here we can see that javascript's atob method is used on url parameter. This is javascripts b64decode function. We can run it in python:
$ ptpython
>>> from base64 import b64decode
>>> b64decode("eyJrIjoiMGIwNTY2MjgtMzJhYy00MzEwLTk5MDAtYTI2MGVlMzk1NjM2IiwidCI6IjZmMGU5YzQyLTk2Y2UtNDU1MS05NzAxLWJhMzF
1 kMGQ2ZDE5ZSJ9")
b'{"k":"0b056628-32ac-4310-9900-a260ee395636","t":"6f0e9c42-96ce-4551-9701-ba31d0d6d19e"}'
We got it figured out! Now lets put everything together in our crawler:
import json
from base64 import b64decode
from w3lib.url import url_query_parameter
def parse(self, response):
url = "https://app.powerbi.com/view?r=eyJrIjoiMGIwNTY2MjgtMzJhYy00MzEwLTk5MDAtYTI2MGVlMzk1NjM2IiwidCI6IjZmMGU5YzQyLTk2Y2UtNDU1MS05NzAxLWJhMzFkMGQ2ZDE5ZSJ9"
# get the "r" paremeter from url
resource_key = url_query_parameter(url, 'r')
# base64 decode it
resource_key = b64decode(resource_key)
# {'k': '0b056628-32ac-4310-9900-a260ee395636', 't': '6f0e9c42-96ce-4551-9701-ba31d0d6d19e'}
# it's a json string - load it and get key "k"
resource_key = json.loads(resource_key)['k']
headers = {
'Accept': "application/json, text/plain, */*",
# 'X-PowerBI-ResourceKey': "0b056628-32ac-4310-9900-a260ee395636",
'X-PowerBI-ResourceKey': resource_key,
'User-Agent': "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/79.0.3945.117 Safari/537.36",
'Accept-Encoding': "gzip, deflate, br",
'Accept-Language': "en-US,en;q=0.9,lt;q=0.8,et;q=0.7,de;q=0.6",
}
yield Request(url, headers=headers)

Pandas read_csv from URL and include request header

As of Pandas 0.19.2, the function read_csv() can be passed a URL. See, for example, from this answer:
import pandas as pd
url="https://raw.githubusercontent.com/cs109/2014_data/master/countries.csv"
c=pd.read_csv(url)
The URL I'd like to use is: https://moz.com/top500/domains/csv
With the above code, this URL returns an error:
urllib2.HTTPError: HTTP Error 403: Forbidden
based on this post, I can get a valid response by passing a request header:
import urllib2,cookielib
site= "https://moz.com/top500/domains/csv"
hdr = {'User-Agent': 'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.11 (KHTML, like Gecko) Chrome/23.0.1271.64 Safari/537.11',
'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8',
'Accept-Charset': 'ISO-8859-1,utf-8;q=0.7,*;q=0.3',
'Accept-Encoding': 'none',
'Accept-Language': 'en-US,en;q=0.8',
'Connection': 'keep-alive'}
req = urllib2.Request(site, headers=hdr)
try:
page = urllib2.urlopen(req)
except urllib2.HTTPError, e:
print (e.fp.read())
content = page.read()
print (content)
Is there any way to use the web URL functionality of Pandas read_csv(), but also pass a request header to make the request go through?
I would recommend you using the requests and the io library for your task. The following code should do the job:
import pandas as pd
import requests
from io import StringIO
url = "https://moz.com:443/top500/domains/csv"
headers = {"User-Agent": "Mozilla/5.0 (Macintosh; Intel Mac OS X 10.14; rv:66.0) Gecko/20100101 Firefox/66.0"}
req = requests.get(url, headers=headers)
data = StringIO(req.text)
df = pd.read_csv(data)
print(df)
(If you want to add a custom header just modify the headers variable)
Hope this helps
As of pandas 1.3.0, you can now pass custom HTTP(s) headers using storage_options argument:
url = "https://moz.com:443/top500/domains/csv"
hdr = {
'User-Agent': 'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.11 (KHTML, like Gecko) Chrome/23.0.1271.64 Safari/537.11',
'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8',
'Accept-Charset': 'ISO-8859-1,utf-8;q=0.7,*;q=0.3',
'Accept-Encoding': 'none',
'Accept-Language': 'en-US,en;q=0.8',
'Connection': 'keep-alive'
}
domains_df = pd.read_csv(url, storage_options=hdr)

POST requests using cookie from session

I am trying to scrape a website using POST request to fill the form:
http://www.planning2.cityoflondon.gov.uk/online-applications/search.do?action=advanced
in python, this goes as follow:
import requests
import webbrowser
headers = {
'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8',
'Accept-Encoding': 'gzip, deflate',
'Accept-Language': 'en-US,en;q=0.9',
'Cache-Control': 'max-age=0',
'Connection': 'keep-alive',
'Cookie': 'JSESSIONID=OwXG0Hkxj+X9ELygHZa-aLQ5.undefined; _ga=GA1.3.1911942552.',
'Content-Type': 'application/x-www-form-urlencoded',
'Host': 'www.planning2.cityoflondon.gov.uk',
'Origin': 'http://www.planning2.cityoflondon.gov.uk',
'Referer': 'http://www.planning2.cityoflondon.gov.uk/online-applications/search.do?action=advanced',
'Upgrade-Insecure-Requests': '1',
'User-Agent': 'Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/67.0.3396.99 Safari/537.36'
}
data = {
'searchCriteria.developmentType': '002',
'date(applicationReceivedStart)': '01/08/2000',
'date(applicationReceivedEnd)': '01/08/2018'
}
url = 'http://www.planning2.cityoflondon.gov.uk/online-applications/advancedSearchResults.do?action=firstPage'
test_file = 'planning_app.html'
with requests.Session() as session:
r = session.post(url, headers = headers, data = data)
with open (test_file, 'w') as file:
file.write(r.text)
webbrowser.open(test_file)
As you can see from the page reopened with webbrowser, this gives an error of outdated cookie.
For this to work I would need to manually go to the webpage, perform a query while opening the inspect panel of google chrome on the network tab, look at the cookie in the requests header and copy paste the cookie in my code. This would work until of course the cookie is expired again.
I tried to automate that retrieval of the cookie by doing the following:
headers_get = {
'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8',
'Accept-Encoding': 'gzip, deflate',
'Accept-Language': 'en-US,en;q=0.9',
'Cache-Control': 'max-age=0',
'Connection': 'keep-alive',
'Host': 'www.planning2.cityoflondon.gov.uk',
'Upgrade-Insecure-Requests': '1',
'User-Agent': 'Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/68.0.3440.106 Safari/537.36'
}
with requests.Session() as session:
c = session.get('http://www.planning2.cityoflondon.gov.uk/online-applications/', headers = headers_get)
headers['Cookie'] = 'JSESSIONID=' + list(c.cookies.get_dict().values())[0]
r = session.post(url, headers = headers, data = data)
with open (test_file, 'w') as file:
file.write(r.text)
webbrowser.open(test_file)
I would expect this to work as it is simply automating what i do manually:
Go to the page of the GET request, get the cookie from it add said cookie to the headers dict of the POST request.
However I still receive the 'server error' page from the POST requests.
Anyone would be able to get an understanding of why this happen?
The requests.post accept cookies name parameter. Using it instead of sending cookies directly in header may fix the problem:
with requests.Session() as session:
c = session.get('http://www.planning2.cityoflondon.gov.uk/online- applications/', headers = headers_get)
# Also, you can set with cookies=session.cookies
r = session.post(url, headers = headers, data = data, cookies=c.cookies)
Basically I suppose there may be some javascript logic on the site, which isn't executed with the use of requests.post. If that's the case, to resolve that you have to use selenium for filling and submitting form.
Please see Dynamic Data Web Scraping with Python, BeautifulSoup which has similar problem - javascript not executed.

Authentication Trouble with Python Requests [duplicate]

i want scrap the PINCODEs from "http://www.indiapost.gov.in/pin/", i am doing with following code written.
import urllib
import urllib2
headers = {
'Accept':'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8',
'Origin': 'http://www.indiapost.gov.in',
'User-Agent': 'Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.17 (KHTML, like Gecko) Chrome/24.0.1312.57 Safari/537.17',
'Content-Type': 'application/x-www-form-urlencoded',
'Referer': 'http://www.indiapost.gov.in/pin/',
'Accept-Encoding': 'gzip,deflate,sdch',
'Accept-Language': 'en-US,en;q=0.8',
'Accept-Charset': 'ISO-8859-1,utf-8;q=0.7,*;q=0.3'
}
viewstate = 'JulXDv576ZUXoVOwThQQj4bDuseXWDCZMP0tt+HYkdHOVPbx++G8yMISvTybsnQlNN76EX/...'
eventvalidation = '8xJw9GG8LMh6A/b6/jOWr970cQCHEj95/6ezvXAqkQ/C1At06MdFIy7+iyzh7813e1/3Elx...'
url = 'http://www.indiapost.gov.in/pin/'
formData = (
('__EVENTVALIDATION', eventvalidation),
('__EVENTTARGET',''),
('__EVENTARGUMENT',''),
('__VIEWSTATE', viewstate),
('__VIEWSTATEENCRYPTED',''),
('__EVENTVALIDATION', eventvalidation),
('txt_offname',''),
('ddl_dist','0'),
('txt_dist_on',''),
('ddl_state','2'),
('btn_state','Search'),
('txt_stateon',''),
('hdn_tabchoice','3')
)
from urllib import FancyURLopener
class MyOpener(FancyURLopener):
version = 'Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.17 (KHTML, like Gecko) Chrome/24.0.1312.57 Safari/537.17'
myopener = MyOpener()
encodedFields = urllib.urlencode(formData)
f = myopener.open(url, encodedFields)
print f.info()
try:
fout = open('tmp.txt', 'w')
except:
print('Could not open output file\n')
fout.writelines(f.readlines())
fout.close()
i am getting response from server as "Sorry this site has encountered a serious problem, please try reloading the page or contact webmaster."
pl suggest where i am going wrong..
Where did you get the value viewstate and eventvalidation? On one hand, they shouldn't end with "...", you must have omitted something. On the other hand, they shouldn't be hard-coded.
One solution is like this:
Retrieve the page via URL "http://www.indiapost.gov.in/pin/" without any form data
Parse and retrieve the form values like __VIEWSTATE and __EVENTVALIDATION (you may take use of BeautifulSoup).
Get the search result(second HTTP request) by adding vital form-data from step 2.
UPDATE:
According to the above idea, I modify your code slightly to make it work:
import urllib
from bs4 import BeautifulSoup
headers = {
'Accept':'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8',
'Origin': 'http://www.indiapost.gov.in',
'User-Agent': 'Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.17 (KHTML, like Gecko) Chrome/24.0.1312.57 Safari/537.17',
'Content-Type': 'application/x-www-form-urlencoded',
'Referer': 'http://www.indiapost.gov.in/pin/',
'Accept-Encoding': 'gzip,deflate,sdch',
'Accept-Language': 'en-US,en;q=0.8',
'Accept-Charset': 'ISO-8859-1,utf-8;q=0.7,*;q=0.3'
}
class MyOpener(urllib.FancyURLopener):
version = 'Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.17 (KHTML, like Gecko) Chrome/24.0.1312.57 Safari/537.17'
myopener = MyOpener()
url = 'http://www.indiapost.gov.in/pin/'
# first HTTP request without form data
f = myopener.open(url)
soup = BeautifulSoup(f)
# parse and retrieve two vital form values
viewstate = soup.select("#__VIEWSTATE")[0]['value']
eventvalidation = soup.select("#__EVENTVALIDATION")[0]['value']
formData = (
('__EVENTVALIDATION', eventvalidation),
('__VIEWSTATE', viewstate),
('__VIEWSTATEENCRYPTED',''),
('txt_offname', ''),
('ddl_dist', '0'),
('txt_dist_on', ''),
('ddl_state','1'),
('btn_state', 'Search'),
('txt_stateon', ''),
('hdn_tabchoice', '1'),
('search_on', 'Search'),
)
encodedFields = urllib.urlencode(formData)
# second HTTP request with form data
f = myopener.open(url, encodedFields)
try:
# actually we'd better use BeautifulSoup once again to
# retrieve results(instead of writing out the whole HTML file)
# Besides, since the result is split into multipages,
# we need send more HTTP requests
fout = open('tmp.html', 'w')
except:
print('Could not open output file\n')
fout.writelines(f.readlines())
fout.close()

Categories

Resources