How can I get html code of the page in python? - python

I need to get html code of the website but I get only 403 error or 403 status_code
import urllib
req = urllib.request.Request(url)
req.add_header('User-Agent', 'Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.9.0.7) Gecko/2009021910 Firefox/3.0.7')
response = urllib.request.urlopen(req)
data = response.read() # a `bytes` object
html = data.decode('utf-8') # a `str`; this step can't be used if data is binary
print(html)
It will give this error
urllib.error.HTTPError: HTTP Error 403: Forbidden
And I tried this too. (verify=False doesn't work too)
import requests
res = requests.get('https://santehnika-online.ru/cart-link/b15caf63da6313698633f41747b9d9eb/', headers={'user-agent':'Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.9.0.7) Gecko/2009021910 Firefox/3.0.7'})
print(res.text, res.status_code)
I got it. It is checking browser page, but it doesn't exist if I open this website in browser
<!DOCTYPE html>
<html lang="ru">
<head>
<title>Проверка браузера</title>
<meta http-equiv="Content-Type" content="text/html; charset=utf-8" />
<meta http-equiv="X-UA-Compatible" content="IE=edge" />
<meta name="viewport" content="width=device-width,initial-scale=1.0" />
<link rel="preconnect" href="https://fonts.googleapis.com">
<link rel="preconnect" href="https://fonts.gstatic.com" crossorigin>
</header>
<div class="message">
<div class="wrapper wrapper_message">
<div class="message-content">
<div class="message-title">Проверяем браузер</div>
<div>Сайт скоро откроется</div>
</div>
<div class="loader"></div>
</div>
</div>
<div class="captcha">
<form id="challenge-form" class="challenge-form" action="/cart-link/b15caf63da6313698633f41747b9d9eb/?__cf_chl_f_tk=iRm3UPbqu59isE.16Z5X1Rx8TYQzALW_hvQcP9ji_Rc-1676103711-0-gaNycGzNCXs" method="POST" enctype="application/x-www-form-urlencoded">
<div id="cf-please-wait">
<div id="spinner">
Help me please to get 200 status_code and good hrml code.

If getting a status code of 403, according to google:
An HTTP 403 response code means that a client is forbidden from accessing a valid URL. The server understands the request, but it can't fulfill the request because of client-side issues.
So this means that the website is inaccessible.

Best way to get html websites (in my oppunion) is using Beautifulsoup library
simple
import urllib2
from bs4 import BeautifulSoup
# Fetch the html file
response = urllib2.urlopen('http://tutorialspoint.com/python/python_overview.htm')
html_doc = response.read()
# Parse the html file
soup = BeautifulSoup(html_doc, 'html.parser')
# Format the parsed html file
strhtm = soup.prettify()
# Print the first few characters
print (strhtm[:225])
the best part for this library is usage like this
this part is extracting tag value
import urllib2
from bs4 import BeautifulSoup
response = urllib2.urlopen('http://tutorialspoint.com/python/python_overview.htm')
html_doc = response.read()
soup = BeautifulSoup(html_doc, 'html.parser')
print (soup.title)
print(soup.title.string)
print(soup.a.string)
print(soup.b.string)
also about the error maybe IP blocking you can use proxy or other agents
but I think maybe website is using cookies you can track the website using console network in inspect to check if any suspicion cookies you get after going inside the website and add that cookie to your request and see the results

The site is protected by cloudflare technology, which is currently one of the best. The cloudflare checks if the client can run javascript (this is the biggest difference between python requests and the browser).
Use these projects that are designed to pass through cloudflare's protection:
cloudscrape
undetected selenium

Related

Downloading excel file from url in pandas (post authentication)

I am facing a strange problem, I dont know much about for my lack of knowledge of html. I want to download an excel file post login from a website.
The file_url is:
file_url="https://xyz.xyz.com/portal/workspace/IN AWP ABRL/Reports & Analysis Library/CDI Reports/CDI_SM_Mar'20.xlsx"
There is a share button for the file which gives the link2 (For the same file):
file_url2='http://xyz.xyz.com/portal/traffic/4a8367bfd0fae3046d45cd83085072a0'
When I use requests.get to read link 2 (post login to a session) I am able to read the excel into pandas. However, link 2 does not serve my purpose as I cant schedule my report on this on a periodic basis (by changing Mar'20 to Apr'20 etc). Link1 suits my purpose but gives the following on passing r=requests.get in the r.content method:
b'\n\n\n\n\n\n\n\n\n\n<html>\n\t<head>\n\t\t<title></title>\n\t</head>\n\t\n\t<body bgcolor="#FFFFFF">\n\t\n\n\t<script language="javascript">\n\t\t<!-- \n\t\t\ttop.location.href="https://xyz.xyz.com/portal/workspace/IN%20AWP%20ABRL/Reports%20&%20Analysis%20Library/CDI%20Reports/CDI_SM_Mar\'20.xlsx";\t\n\t\t-->\n\t</script>\n\t</body>\n</html>'
I have tried all encoding decoding of url but cant understand this alphanumeric url (link2).
My python code (working) is:
import requests
url = 'http://xyz.xyz.com/portal/site'
username=''
password=''
s = requests.Session()
headers = {'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_4) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/81.0.4044.138 Safari/537.36'}
r = s.get(url,auth=(username, password),verify=False,headers=headers)
r2 = s.get(file_url,verify=False,allow_redirects=True)
r2.content
# df=pd.read_excel(BytesIO(r2.content))
You get HTML with JavaScript which redirects browser to new url. But requests can't run JavaScript. it is simple methods to block some simple scripts/bots.
But HTML is only string so you can use string's functions to get url from string and use this url with requests to get file.
content = b'\n\n\n\n\n\n\n\n\n\n<html>\n\t<head>\n\t\t<title></title>\n\t</head>\n\t\n\t<body bgcolor="#FFFFFF">\n\t\n\n\t<script language="javascript">\n\t\t<!-- \n\t\t\ttop.location.href="https://xyz.xyz.com/portal/workspace/IN%20AWP%20ABRL/Reports%20&%20Analysis%20Library/CDI%20Reports/CDI_SM_Mar\'20.xlsx";\t\n\t\t-->\n\t</script>\n\t</body>\n</html>'
text = content.decode()
print(text)
print('\n---\n')
start = text.find('href="') + len('href="')
end = text.find('";', start)
url = text[start:end]
print('url:', url)
response = s.get(url)
Results:
<html>
<head>
<title></title>
</head>
<body bgcolor="#FFFFFF">
<script language="javascript">
<!--
top.location.href="https://xyz.xyz.com/portal/workspace/IN%20AWP%20ABRL/Reports%20&%20Analysis%20Library/CDI%20Reports/CDI_SM_Mar'20.xlsx";
-->
</script>
</body>
</html>
---
url: https://xyz.xyz.com/portal/workspace/IN%20AWP%20ABRL/Reports%20&%20Analysis%20Library/CDI%20Reports/CDI_SM_Mar'20.xlsx

Webscraping with http shows "Web page blocked"

I am trying to scrape http website using proxies and when I am trying to extract text, it shows as "Web page Blocked". How could I avoid this error?
My code is as follows
url = "http://campanulaceae.myspecies.info/"
proxy_dict = {
'http' : "174.138.54.49:8080",
'https' : "174.138.54.49:8080"
}
page = requests.get(url, proxies=proxy_dict)
soup = BeautifulSoup(page.text,'html.parser')
print(soup)
I get below output when I am trying to output text from the website.
<html>
<head>
<title>Web Page Blocked</title>
<meta content="text/html; charset=utf-8" http-equiv="Content-Type"/>
<meta content="NO-CACHE" http-equiv="PRAGMA"/>
<meta content="initial-scale=1.0" name="viewport"/>
........
<body bgcolor="#e7e8e9">
<div id="content">
<h1>Web Page Blocked</h1>
<p>Access to the web page you were trying to visit has been blocked in accordance with company policy. Please contact your system administrator if you believe this is in error.</p>
Because you did not specify a user-agent for the request headers.
Quite often, sites block requests that come from robot-like sources.
Try it like this:
headers = {'user-agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/80.0.3987.106 Safari/537.36'}
page = requests.get(url, headers=headers, proxies=proxy_dict)

Using session.get() to retrieve content from web page

I'm attempting to scrape listings from Autotrader.com using the following code:
import requests
session = requests.Session()
url = 'https://www.autotrader.com/cars-for-sale/Burlingame+CA-94010?searchRadius=10&zip=94010&marketExtension=include&isNewSearch=true&sortBy=relevance&numRecords=25&firstRecord=0'
homepage = session.get(url)
It looks like the connection was successfully established:
In[115]: homepage
Out[115]: <Response [200]>
However, accessing the homepage content shows an error message and nothing resembling the content accessible via browser:
In[121]: homepage.content
Out[121]:
<!DOCTYPE html>
<html lang="en" xml:lang="en" xmlns="http://www.w3.org/1999/xhtml">
<head>
<title>Autotrader - page unavailable</title>
(...)
<h1>We're sorry for any inconvenience, but the site is currently unavailable.</h1>
(...)
I've tried adding a different user agent in headers using user_agent:
headers = {'User-Agent': generate_user_agent()}
homepage = session.get(url, headers=headers)
But get the same result: page unavailable
I also tried pointing to a security certificate (the root one?) that I downloaded from Chrome:
certificate = './certificate/root.cer'
homepage = session.get(url, headers=headers, verify=certificate)
but I see the error:
File "/Users/michaelboles/Applications/anaconda3/lib/python3.7/site-packages/OpenSSL/_util.py", line 54, in exception_from_error_queue
raise exception_type(errors)
Error: [('x509 certificate routines', 'X509_load_cert_crl_file', 'no certificate or crl found')]
So I may not be doing that last part correctly.
Can anyone offer any help on retrieving Autotrader webpage content as it is displayed in the browser?
I don't know what User-Agent this user_agent module generates, but when I run:
import requests
url = 'https://www.autotrader.com/cars-for-sale/Burlingame+CA-94010?searchRadius=10&zip=94010&marketExtension=include&isNewSearch=true&sortBy=relevance&numRecords=25&firstRecord=0'
headers = {'User-Agent':'Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:72.0) Gecko/20100101 Firefox/72.0'} # <-- try this header
print( requests.get(url, headers=headers).text )
I get normal page:
<!DOCTYPE html>
<html>
<head><script>
window.NREUM||(NREUM={}),__nr_require=function(t,n,e){function r(e){if(!n[e]){var o=n[e]={exports:{}};t[e][0].call(o.exports,function(n){var o=t[e][1][n];return r(o||n)},o,o.exports)}return n[e].exports}if("function"==typeof __nr_require)return __nr_require;for(var o=0;o<e.length;o++)r(e[o]);return r}({1:[function(t,n,e){function r(t){try{s.console&&console.log(t)}catch(n){}}var o,i=t("ee"),a=t(15),s={};try{o=localStorage.getItem("__nr_flags").split(","),console&&"function"==typeof console.log&&(s.console=!0,o.indexOf("dev")!==-1&&(s.dev=!0),o.indexOf("nr_dev")!==-1&&(s.nrDev=!0))}catch(c){}s.nrDev&&i.on("internal-error",func
...

Error 400 GET Bad Request with Python Requests

I'm trying to do some scraping on a web application using Python to extract information from it, and it is protected by HTTPBasicAuth.
This is my code so far:
from lxml import html
import requests
from requests.auth import HTTPBasicAuth
headers = { 'User-Agent' : 'Mozilla/5.0 (Windows NT 10.0; WOW64; rv:67.0) Gecko/20100101 Firefox/67.0' }
page = requests.get('https://10.0.0.1:999/app/info' , verify = False , auth = ('user' , 'pass'), headers = headers)
print (page.content.decode())
But I'm getting this answer from print (page.content.decode()):
<?xml version="1.0" encoding="iso-8859-1"?>
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN"
"http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">
<html xmlns="http://www.w3.org/1999/xhtml" xml:lang="en" lang="en">
<head>
<title>400 - Bad Request</title>
</head>
<body>
<h1>400 - Bad Request</h1>
</body>
</html>
What could be missing?
Apparently I was using HTTPBasicAuth, and I had to use HTTPDigestAuth. Even though the website seemed to be using Basic Authentication, after an inspection of the traffic using Burp Proxy, I noticed it was using Digest Authentication.

Web scraping, Url jump prevents authorization?

I've been trying to scrape this website (www.dearedu.com), specifically, i've been having tremendous amounts of difficulties logging in...having tried everything I could find on previously answered authorization questions on stack exchange.
Currently, I am using request sessions to log in using the following code,
cj = cookielib.CookieJar()
mySession = requests.session()
mySession.headers.update({'User-Agent':'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_11_6) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/59.0.3071.115 Safari/537.36'})
data = mySession.get('http://www.dearedu.com/', cookies = cj)
data= {'userid': myusername, 'pwd': mypassword,
'fmdo': 'login', 'dopost': 'login',
'keeptime': '604800', 'teshu': 't'}
data = mySession.post('http://club.dearedu.com/member/index_do.php', data=data)
when the code above is run with the correct password and username, you get the following html
<head>
<title>第二教育网提示信息</title>
<meta http-equiv="Content-Type" content="text/html; charset=gb2312" />
<base target='_self'/>
<style>div{line-height:160%;}</style></head>
<body leftmargin='0' topmargin='0' bgcolor='#FFFFFF'>
<center>
<script>
var pgo=0;
function JumpUrl(){
if(pgo==0){ location='http://www.dearedu.com'; pgo=1; }
}
document.write("<br /><div style='width:450px;padding:0px;border:1px
solid #DADADA;'><div style='padding:6px;font-size:12px;border-
bottom:1px solid #DADADA;background:#DBEEBD
url(/plus/img/wbg.gif)';'><b>第二教育网提示信息!</b></div>");
document.write("<div style='height:130px;font-
size:10pt;background:#ffffff'><br />");
document.write("成功登录,现在转向系统主页...");
document.write("<br /><a href='http://www.dearedu.com'>如果你的浏览器没
反应,请点击这里...</a><br/></div>");
setTimeout('JumpUrl()',1000);</script>
</center>
</body>
</html>
The thing I do not understand is that the cookies and status code received indicate that I have successfully logged in, but when I try to then access the main page it indicates that I am not.
If i had to take a guess is that it has something to do with the url jumping. Specifically the website waits for like a second or so before redirecting you to the main page.
Can someone explain what is wrong and how to fix it? Thank you!
EDIT:
"成功登录,现在转向系统主页" = successfully logged in, redirecting to homepage
"如果你的浏览器没反应,请点击这里" = If your browser does not respond, please click here
The rest I do not think is relevant. Thanks!

Categories

Resources