Weird encoding file format outputted by BeautifulSoup

Weird encoding file format outputted by BeautifulSoup - python

I would like to access and scrape the data from this link.
where;
new_url='https://www.scopus.com/results/results.uri?sort=plf-f&src=s&imp=t&sid=2c816e0ea43cf176a59117097216e6d4&sot=b&sdt=b&sl=160&s=%28TITLE-ABS-KEY%28EEG%29AND+TITLE-ABS-KEY%28%22deep+learning%22%29+AND+DOCTYPE%28ar%29%29+AND+ORIG-LOAD-DATE+AFT+1591735287+AND+ORIG-LOAD-DATE+BEF+1592340145++AND+PUBYEAR+AFT+2018&origin=CompleteResultsEmailAlert&dgcid=raven_sc_search_en_us_email&txGid=cc4809850a0eff92f629c95380f9f883'
As accessing the new_url via the following line
req = Request(url, headers={'User-Agent': 'Mozilla/5.9'})
produced the error
Webscraping: HTTP Error 302: The HTTP server returned a redirect error that would lead to an infinite loop
A set of new line was drafted
req = urllib.request.Request(new_url, None, {'User-Agent': 'Mozilla/5.0 (X11; Linux i686; G518Rco3Yp0uLV40Lcc9hAzC1BOROTJADjicLjOmlr4=) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/44.0.2403.157 Safari/537.36','Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,*/*;q=0.8','Accept-Charset': 'ISO-8859-1,utf-8;q=0.7,*;q=0.3','Accept-Encoding': 'gzip, deflate, sdch','Accept-Language': 'en-US,en;q=0.8','Connection': 'keep-alive'})
cj = CookieJar()
opener = urllib.request.build_opener(urllib.request.HTTPCookieProcessor(cj))
raw = opener.open(req).read()
page_soup = soup(raw, 'html.parser')
print(page_soup.prettify())
While no error is thrown out, but the
print(page_soup.prettify())
output some unrecognized text format
6�>�.�t1k�e�LH�.��]WO�?m�^#�
څ��#�h[>��!�H8����|����n(XbU<~�k�"���#g+�4�Ǻ�Xv�7�UȢB2�
�7�F8�XA��W\�ɚ��^8w��38�#'
SH�<_0�B���oy�5Bނ)E���GPq:�ќU�c���ab�h�$<ra�
;o�Q�a#ð�d\�&J3Τa�����:�I�etf�a���h�$(M�~���ua�$�
n�&9u%ҵ*b���w�j�V��P�D�'z[��������)
with a warning
Some characters could not be decoded, and were replaced with REPLACEMENT CHARACTER.
I suspect, this can be resolved by encode it using utf-8, which is as below
req = urllib.request.Request(new_url, None, {'User-Agent': 'Mozilla/5.0 (X11; Linux i686; G518Rco3Yp0uLV40Lcc9hAzC1BOROTJADjicLjOmlr4=) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/44.0.2403.157 Safari/537.36','Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,*/*;q=0.8','Accept-Charset': 'ISO-8859-1,utf-8;q=0.7,*;q=0.3','Accept-Encoding': 'gzip, deflate, sdch','Accept-Language': 'en-US,en;q=0.8','Connection': 'keep-alive'})
cj = CookieJar()
opener = urllib.request.build_opener(urllib.request.HTTPCookieProcessor(cj))
raw = opener.open(req).read()
with open(raw, 'r', encoding='utf-8') as f:
page_soup = soup(f, 'html.parser')
print(page_soup.prettify())
However, the compiler return an error
UnicodeDecodeError: 'utf-8' codec can't decode byte 0x8b in position
1: invalid start byte
May I know what is the problem, appreciate for any insight.

Try using the requests library
import requests
from bs4 import BeautifulSoup
headers = {"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/83.0.4103.106 Safari/537.36"}
with requests.Session() as s:
r = s.get(new_url, headers = headers)
soup = BeautifulSoup(r.text, 'lxml')
print(soup.get_text())
you can still use cookies here
Edit: Updated code to show the use of headers, this would tell the website you are a browser instead of a program - but further login operations I would suggest the use of selenium instead of requests

If you want to use urllib library, remove Accept-Encoding from the headers (also specify Accept-Charset just utf-8 for simplicity):
req = urllib.request.Request(new_url, None, {'User-Agent': 'Mozilla/5.0 (X11; Linux i686; G518Rco3Yp0uLV40Lcc9hAzC1BOROTJADjicLjOmlr4=) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/44.0.2403.157 Safari/537.36','Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,*/*;q=0.8','Accept-Charset': 'utf-8;q=0.7,*;q=0.3','Accept-Language': 'en-US,en;q=0.8','Connection': 'keep-alive'})
The result is:
<!DOCTYPE html>
<!-- Form Name: START -->
<html lang="en">
<!-- Template_Component_Name: id.start.vm -->
<head>
<meta charset="utf-8"/>
...etc.

Related

can't find the right compression for this webpage (python requests.get)

I can load this webpage in Google Chrome, but I can't access it via requests. Any idea what the compression problem is?
Code:
import requests
url = r'https://www.huffpost.com/entry/sean-hannity-gutless-tucker-carlson_n_60d5806ae4b0b6b5a164633a'
headers = {'Accept-Encoding':'gzip, deflate, compress, br, identity'}
r = requests.get(url, headers=headers)
Result:
ContentDecodingError: ('Received response with content-encoding: gzip, but failed to decode it.', error('Error -3 while decompressing data: incorrect header check'))

Use a user agent that emulates a browser:
import requests
url = r'https://www.huffpost.com/entry/sean-hannity-gutless-tucker-carlson_n_60d5806ae4b0b6b5a164633a'
headers = {"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/70.0.3538.77 Safari/537.36"}
r = requests.get(url, headers=headers)

You're getting a 403 Forbidden error, which you can see using requests.head. Use RJ's suggestion to defeat huffpost's robot blocking.
>>> requests.head(url)
<Response [403]>

The website exists but request.head/get times out

I have written a Python script to check whether a website exists or not. Everything works fine, except when checking http://www.dhl.com - the request times out. I have tried both GET and HEAD methods. I used https://httpstatus.io/ and https://app.urlcheckr.com/ to check DHL website and the result is error. The DHL website DOES exist! Here is my code:
import requests
a ='http://www.dhl.com'
def check(url):
try:
header = {'User-Agent':'Mozilla/5.0 (X11; CrOS x86_64 8172.45.0) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/51.0.2704.64 Safari/537.36'}
request = requests.head(url, headers = header , timeout = 60)
code = request.status_code
if code < 400:
return "Exist",str(code)
else:
return "Not exist", str(code)
except Exception as e:
return "Not Exist",str(type(e).__name__)
print(check(a))
How can I resolve this error?

Testing with curl shows you need a couple of other headers for that DHL site
import requests
url = 'http://www.dhl.com'
headers = {
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/84.0.4147.135 Safari/537.36',
'Accept-Encoding': 'gzip, deflate, br',
'Accept-Language': 'en-US,en;q=0.9,fil;q=0.8',
}
request = requests.head(url, headers=headers, timeout=60, allow_redirects=True)
print(request.status_code, request.reason)
print(request.history)
Without these headers, curl never gets a response.

Scraping Data from .ASPX Website URL with Python

I have a static .aspx url that I am trying to scrape. All of my attempts yield the raw html data of the regular website instead of the data I am querying.
My understanding is the headers I am using (which I found from another post) are correct and generalizable:
import urllib.request
from bs4 import BeautifulSoup
headers = {
'Accept':'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8',
'User-Agent': 'Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.17 (KHTML, like Gecko) Chrome/24.0.1312.57 Safari/537.17',
'Content-Type': 'application/x-www-form-urlencoded',
'Accept-Encoding': 'gzip,deflate,sdch',
'Accept-Language': 'en-US,en;q=0.8',
'Accept-Charset': 'ISO-8859-1,utf-8;q=0.7,*;q=0.3'
}
class MyOpener(urllib.request.FancyURLopener):
version = 'Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.17 (KHTML, like Gecko) Chrome/24.0.1312.57 Safari/537.17'
myopener = MyOpener()
url = 'https://www.mytaxcollector.com/trSearch.aspx'
# first HTTP request without form data
f = myopener.open(url)
soup_dummy = BeautifulSoup(f,"html5lib")
# parse and retrieve two vital form values
viewstate = soup_dummy.select("#__VIEWSTATE")[0]['value']
viewstategen = soup_dummy.select("#__VIEWSTATEGENERATOR")[0]['value']
Trying to enter the form data causes nothing to happen:
formData = (
('__VIEWSTATE', viewstate),
('__VIEWSTATEGENERATOR', viewstategen),
('ctl00_contentHolder_trSearchCharactersAPN', '631091430000'),
('__EVENTTARGET', 'ct100$MainContent$calculate')
)
encodedFields = urllib.parse.urlencode(formData)
# second HTTP request with form data
f = myopener.open(url, encodedFields)
soup = BeautifulSoup(f,"html5lib")
trans_emissions = soup.find("span", id="ctl00_MainContent_transEmissions")
print(trans_emissions.text)
This give raw html code almost exactly the same as the "soup_dummy" variable. But what I want to see is the data of the field ('ctl00_contentHolder_trSearchCharactersAPN', '631091430000') being submitted (this is the "parcel number" box.
I would really appreciate the help. If anything, linking me to a good post about HTML requests (one that not only explains but actually walks through scraping aspx) would be great.

To get the result using the parcel number, your parameters have to be somewhat different from what you have already tried with. Moreover, you have to use this url https://www.mytaxcollector.com/trSearchProcess.aspx to send the post requests.
Working code:
from urllib.request import Request, urlopen
from urllib.parse import urlencode
from bs4 import BeautifulSoup
url = 'https://www.mytaxcollector.com/trSearchProcess.aspx'
payload = {
'hidRedirect': '',
'hidGotoEstimate': '',
'txtStreetNumber': '',
'txtStreetName': '',
'cboStreetTag': '(Any Street Tag)',
'cboCommunity': '(Any City)',
'txtParcelNumber': '0108301010000', #your search term
'txtPropertyID': '',
'ctl00$contentHolder$cmdSearch': 'Search'
}
data = urlencode(payload)
data = data.encode('ascii')
req = Request(url,data)
req.add_header('User-Agent', 'Mozilla/5.0 (Windows NT 6.1; ) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/83.0.4103.116 Safari/537.36')
res = urlopen(req)
soup = BeautifulSoup(res.read(),'html.parser')
for items in soup.select("table.propInfoTable tr"):
data = [item.get_text(strip=True) for item in items.select("td")]
print(data)

HTTP Error 406: Not Acceptable Python urllib2

I get the following error with the code below.
HTTP Error 406: Not Acceptable Python urllib2
This is my first step before I use beautifulsoup to parse the page.
import urllib2
opener = urllib2.build_opener()
opener.addheaders = [('User-agent', 'Mozilla/5.0')]
url = "http://www.choicemoney.us/retail.php"
response = opener.open(url)
All help greatly appreciated.

The resource identified by the request is only capable of generating
response entities which have content characteristics not acceptable
according to the accept headers sent in the request. [RFC2616]
Based on the code and what the RFC describes I assume that you need to set both the key and the value of the User-Agent header correctly.
These are correct examples:
Mozilla/5.0 (X11; U; Linux i686) Gecko/20071127 Firefox/2.0.0.11
Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/41.0.2228.0 Safari/537.36
Mozilla/5.0 (Macintosh; Intel Mac OS X 10_9_3) AppleWebKit/537.75.14 (KHTML, like Gecko) Version/7.0.3 Safari/7046A194A
Just replace the following.
opener.addheaders = [('User-agent', 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_9_3) AppleWebKit/537.75.14 (KHTML, like Gecko) Version/7.0.3 Safari/7046A194A')]

I believe #ipinak's answer is correct.
urllib2 actually provides a default User-Agent that works here, so if you delete opener.addheaders = [('User-agent', 'Mozilla/5.0')] the response should have status code 200.
I recommend the popular requests library for such jobs as its API is much easier to use.
url = "http://www.choicemoney.us/retail.php"
resp = requests.get(url)
print resp.status_code # 200
print resp.content # can be used in your beautifulsoup.

Scraping aspx webpage with Python using BeautifulSoup

I am tryring to scrape this page:
http://www.nitt.edu/prm/nitreg/ShowRes.aspx
Here is the code:
import urllib
from bs4 import BeautifulSoup
headers = {
'Accept':'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8',
'Origin': 'http://www.indiapost.gov.in',
'User-Agent': 'Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.17 (KHTML, like Gecko) Chrome/24.0.1312.57 Safari/537.17',
'Content-Type': 'application/x-www-form-urlencoded',
'Referer': 'http://www.nitt.edu/prm/nitreg/ShowRes.aspx',
'Accept-Encoding': 'gzip,deflate,sdch',
'Accept-Language': 'en-US,en;q=0.8',
'Accept-Charset': 'ISO-8859-1,utf-8;q=0.7,*;q=0.3'
}
class MyOpener(urllib.FancyURLopener):
version = 'Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.17 (KHTML, like Gecko) Chrome/24.0.1312.57 Safari/537.17'
myopener = MyOpener()
url = 'http://www.nitt.edu/prm/nitreg/ShowRes.aspx'
# first HTTP request without form data
f = myopener.open(url)
soup = BeautifulSoup(f)
# parse and retrieve two vital form values
viewstate = soup.findAll("input", {"type": "hidden", "name": "__VIEWSTATE"})
eventvalidation = soup.findAll("input", {"type": "hidden", "name": "__EVENTVALIDATION"})
print viewstate[0]['value']
formData = (
('__EVENTVALIDATION', eventvalidation),
('__VIEWSTATE', viewstate),
('__VIEWSTATEENCRYPTED',''),
('TextBox1', '106110006'),
('Button1', 'Show'),
)
encodedFields = urllib.urlencode(formData)
# second HTTP request with form data
f = myopener.open(url, encodedFields)
try:
# actually we'd better use BeautifulSoup once again to
# retrieve results(instead of writing out the whole HTML file)
# Besides, since the result is split into multipages,
# we need send more HTTP requests
fout = open('tmp.html', 'w')
except:
print('Could not open output file\n')
fout.writelines(f.readlines())
fout.close()
I keep getting a server error:
Source Error:
An unhandled exception was generated during the execution of the current web request. Information regarding the origin and location of the exception can be identified using the exception stack trace below.
Stack Trace:
[FormatException: Invalid character in a Base-64 string.]
System.Convert.FromBase64String(String s) +0
System.Web.UI.LosFormatter.Deserialize(String input) +25
System.Web.UI.Page.LoadPageStateFromPersistenceMedium() +101
[HttpException (0x80004005): Invalid_Viewstate
Client IP: 10.0.0.166
Port: 51915
User-Agent: Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.17 (KHTML, like Gecko) Chrome/24.0.1312.57 Safari/537.17
ViewState: [<input name="__VIEWSTATE" type="hidden" value="dDwtMTM3NzI1MDM3O3Q8O2w8aTwxPjs+O2w8dDw7bDxpPDE+O2k8Mj47PjtsPHQ8cDxwPGw8VmlzaWJsZTs+O2w8bzxmPjs+Pjs+O2w8aTwxPjtpPDM+Oz47bDx0PDtsPGk8Mz47PjtsPHQ8O2w8aTwwPjs+O2w8dDw7bDxpPDE+Oz47bDx0PEAwPDs7Ozs7Ozs7Ozs+Ozs+Oz4+Oz4+Oz4+O3Q8cDxwPGw8VmlzaWJsZTs+O2w8bzxmPjs+Pjs+Ozs+Oz4+O3Q8O2w8aTw5PjtpPDExPjs+O2w8dDxwPHA8bDxWaXNpYmxlOz47bDxvPGY+Oz4+Oz47Oz47dDx0PHA8cDxsPFZpc2libGU7PjtsPG88Zj47Pj47Pjs7Pjs7Pjs+Pjs+Pjs+Pjs+zHrNhAd1tTLXbBUyAJRtS6omUc0="/>]
Http-Referer:
Path: /prm/nitreg/ShowRes.aspx.]
System.Web.UI.Page.LoadPageStateFromPersistenceMedium() +447
System.Web.UI.Page.LoadPageViewState() +18
System.Web.UI.Page.ProcessRequestMain() +447
Invalid character in a Base-64 string.
What is the problem?

You are using the ViewState input object, not the value.
ViewState: [<input name="__VIEWSTATE" type="hidden" value="dDwtMTM3NzI1MDM3O3Q8O2w8aTwxPjs+O2w8dDw7bDxpPDE+O2k8Mj47PjtsPHQ8cDxwPGw8VmlzaWJsZTs+O2w8bzxmPjs+Pjs+O2w8aTwxPjtpPDM+Oz47bDx0PDtsPGk8Mz47PjtsPHQ8O2w8aTwwPjs+O2w8dDw7bDxpPDE+Oz47bDx0PEAwPDs7Ozs7Ozs7Ozs+Ozs+Oz4+Oz4+Oz4+O3Q8cDxwPGw8VmlzaWJsZTs+O2w8bzxmPjs+Pjs+Ozs+Oz4+O3Q8O2w8aTw5PjtpPDExPjs+O2w8dDxwPHA8bDxWaXNpYmxlOz47bDxvPGY+Oz4+Oz47Oz47dDx0PHA8cDxsPFZpc2libGU7PjtsPG88Zj47Pj47Pjs7Pjs7Pjs+Pjs+Pjs+Pjs+zHrNhAd1tTLXbBUyAJRtS6omUc0="/>]
Your formData should be:
formData = (
('__EVENTVALIDATION', eventvalidation[0]['value']),
('__VIEWSTATE', viewstate[0]['value']),
('__VIEWSTATEENCRYPTED',''),
('TextBox1', '106110006'),
('Button1', 'Show'),
)
Note your eventvalidation value has the same issue, I fixed it too.
EDIT:
The __EVENTVALIDATION does not exist in that page. You can just remove __EVENTVALIDATION from formData.
formData = (
('__VIEWSTATE', viewstate[0]['value']),
('__VIEWSTATEENCRYPTED',''),
('TextBox1', '106110006'),
('Button1', 'Show'),
)

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Weird encoding file format outputted by BeautifulSoup - python

Related

can't find the right compression for this webpage (python requests.get)

The website exists but request.head/get times out

Scraping Data from .ASPX Website URL with Python

HTTP Error 406: Not Acceptable Python urllib2

Scraping aspx webpage with Python using BeautifulSoup

Categories

Resources