I'm trying to do some scraping on a web application using Python to extract information from it, and it is protected by HTTPBasicAuth.
This is my code so far:
from lxml import html
import requests
from requests.auth import HTTPBasicAuth
headers = { 'User-Agent' : 'Mozilla/5.0 (Windows NT 10.0; WOW64; rv:67.0) Gecko/20100101 Firefox/67.0' }
page = requests.get('https://10.0.0.1:999/app/info' , verify = False , auth = ('user' , 'pass'), headers = headers)
print (page.content.decode())
But I'm getting this answer from print (page.content.decode()):
<?xml version="1.0" encoding="iso-8859-1"?>
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN"
"http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">
<html xmlns="http://www.w3.org/1999/xhtml" xml:lang="en" lang="en">
<head>
<title>400 - Bad Request</title>
</head>
<body>
<h1>400 - Bad Request</h1>
</body>
</html>
What could be missing?
Apparently I was using HTTPBasicAuth, and I had to use HTTPDigestAuth. Even though the website seemed to be using Basic Authentication, after an inspection of the traffic using Burp Proxy, I noticed it was using Digest Authentication.
Related
I need to get html code of the website but I get only 403 error or 403 status_code
import urllib
req = urllib.request.Request(url)
req.add_header('User-Agent', 'Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.9.0.7) Gecko/2009021910 Firefox/3.0.7')
response = urllib.request.urlopen(req)
data = response.read() # a `bytes` object
html = data.decode('utf-8') # a `str`; this step can't be used if data is binary
print(html)
It will give this error
urllib.error.HTTPError: HTTP Error 403: Forbidden
And I tried this too. (verify=False doesn't work too)
import requests
res = requests.get('https://santehnika-online.ru/cart-link/b15caf63da6313698633f41747b9d9eb/', headers={'user-agent':'Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.9.0.7) Gecko/2009021910 Firefox/3.0.7'})
print(res.text, res.status_code)
I got it. It is checking browser page, but it doesn't exist if I open this website in browser
<!DOCTYPE html>
<html lang="ru">
<head>
<title>Проверка браузера</title>
<meta http-equiv="Content-Type" content="text/html; charset=utf-8" />
<meta http-equiv="X-UA-Compatible" content="IE=edge" />
<meta name="viewport" content="width=device-width,initial-scale=1.0" />
<link rel="preconnect" href="https://fonts.googleapis.com">
<link rel="preconnect" href="https://fonts.gstatic.com" crossorigin>
</header>
<div class="message">
<div class="wrapper wrapper_message">
<div class="message-content">
<div class="message-title">Проверяем браузер</div>
<div>Сайт скоро откроется</div>
</div>
<div class="loader"></div>
</div>
</div>
<div class="captcha">
<form id="challenge-form" class="challenge-form" action="/cart-link/b15caf63da6313698633f41747b9d9eb/?__cf_chl_f_tk=iRm3UPbqu59isE.16Z5X1Rx8TYQzALW_hvQcP9ji_Rc-1676103711-0-gaNycGzNCXs" method="POST" enctype="application/x-www-form-urlencoded">
<div id="cf-please-wait">
<div id="spinner">
Help me please to get 200 status_code and good hrml code.
If getting a status code of 403, according to google:
An HTTP 403 response code means that a client is forbidden from accessing a valid URL. The server understands the request, but it can't fulfill the request because of client-side issues.
So this means that the website is inaccessible.
Best way to get html websites (in my oppunion) is using Beautifulsoup library
simple
import urllib2
from bs4 import BeautifulSoup
# Fetch the html file
response = urllib2.urlopen('http://tutorialspoint.com/python/python_overview.htm')
html_doc = response.read()
# Parse the html file
soup = BeautifulSoup(html_doc, 'html.parser')
# Format the parsed html file
strhtm = soup.prettify()
# Print the first few characters
print (strhtm[:225])
the best part for this library is usage like this
this part is extracting tag value
import urllib2
from bs4 import BeautifulSoup
response = urllib2.urlopen('http://tutorialspoint.com/python/python_overview.htm')
html_doc = response.read()
soup = BeautifulSoup(html_doc, 'html.parser')
print (soup.title)
print(soup.title.string)
print(soup.a.string)
print(soup.b.string)
also about the error maybe IP blocking you can use proxy or other agents
but I think maybe website is using cookies you can track the website using console network in inspect to check if any suspicion cookies you get after going inside the website and add that cookie to your request and see the results
The site is protected by cloudflare technology, which is currently one of the best. The cloudflare checks if the client can run javascript (this is the biggest difference between python requests and the browser).
Use these projects that are designed to pass through cloudflare's protection:
cloudscrape
undetected selenium
I am facing a strange problem, I dont know much about for my lack of knowledge of html. I want to download an excel file post login from a website.
The file_url is:
file_url="https://xyz.xyz.com/portal/workspace/IN AWP ABRL/Reports & Analysis Library/CDI Reports/CDI_SM_Mar'20.xlsx"
There is a share button for the file which gives the link2 (For the same file):
file_url2='http://xyz.xyz.com/portal/traffic/4a8367bfd0fae3046d45cd83085072a0'
When I use requests.get to read link 2 (post login to a session) I am able to read the excel into pandas. However, link 2 does not serve my purpose as I cant schedule my report on this on a periodic basis (by changing Mar'20 to Apr'20 etc). Link1 suits my purpose but gives the following on passing r=requests.get in the r.content method:
b'\n\n\n\n\n\n\n\n\n\n<html>\n\t<head>\n\t\t<title></title>\n\t</head>\n\t\n\t<body bgcolor="#FFFFFF">\n\t\n\n\t<script language="javascript">\n\t\t<!-- \n\t\t\ttop.location.href="https://xyz.xyz.com/portal/workspace/IN%20AWP%20ABRL/Reports%20&%20Analysis%20Library/CDI%20Reports/CDI_SM_Mar\'20.xlsx";\t\n\t\t-->\n\t</script>\n\t</body>\n</html>'
I have tried all encoding decoding of url but cant understand this alphanumeric url (link2).
My python code (working) is:
import requests
url = 'http://xyz.xyz.com/portal/site'
username=''
password=''
s = requests.Session()
headers = {'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_4) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/81.0.4044.138 Safari/537.36'}
r = s.get(url,auth=(username, password),verify=False,headers=headers)
r2 = s.get(file_url,verify=False,allow_redirects=True)
r2.content
# df=pd.read_excel(BytesIO(r2.content))
You get HTML with JavaScript which redirects browser to new url. But requests can't run JavaScript. it is simple methods to block some simple scripts/bots.
But HTML is only string so you can use string's functions to get url from string and use this url with requests to get file.
content = b'\n\n\n\n\n\n\n\n\n\n<html>\n\t<head>\n\t\t<title></title>\n\t</head>\n\t\n\t<body bgcolor="#FFFFFF">\n\t\n\n\t<script language="javascript">\n\t\t<!-- \n\t\t\ttop.location.href="https://xyz.xyz.com/portal/workspace/IN%20AWP%20ABRL/Reports%20&%20Analysis%20Library/CDI%20Reports/CDI_SM_Mar\'20.xlsx";\t\n\t\t-->\n\t</script>\n\t</body>\n</html>'
text = content.decode()
print(text)
print('\n---\n')
start = text.find('href="') + len('href="')
end = text.find('";', start)
url = text[start:end]
print('url:', url)
response = s.get(url)
Results:
<html>
<head>
<title></title>
</head>
<body bgcolor="#FFFFFF">
<script language="javascript">
<!--
top.location.href="https://xyz.xyz.com/portal/workspace/IN%20AWP%20ABRL/Reports%20&%20Analysis%20Library/CDI%20Reports/CDI_SM_Mar'20.xlsx";
-->
</script>
</body>
</html>
---
url: https://xyz.xyz.com/portal/workspace/IN%20AWP%20ABRL/Reports%20&%20Analysis%20Library/CDI%20Reports/CDI_SM_Mar'20.xlsx
I am trying to web scrape a website and when I am doing that I am getting below output.
Is there a way I can scrape this website?
url = "https://www.mustang6g.com/forums/threads/pre-collision-alert-system.132807/"
page = requests.get(url)
soup = BeautifulSoup(page.text, 'html.parser')
print(soup)
Output of the above code is as follows
<!DOCTYPE HTML PUBLIC "-//IETF//DTD HTML 2.0//EN">
<html><head>
<title>403 Forbidden</title>
</head><body>
<h1>Forbidden</h1>
<p>You don't have permission to access this resource.</p>
</body></html>
The website server expected a header to be passed:
import requests
headers = {'User-Agent': 'Mozilla/5.0 (X11; Linux x86_64) '\
'AppleWebKit/537.36 (KHTML, like Gecko) '\
'Chrome/75.0.3770.80 Safari/537.36'}
URL = 'https://www.mustang6g.com/forums/threads/pre-collision-alert-system.132807/'
httpx = requests.get(URL, headers=headers)
print(httpx.text)
By passing header, we told the server that we are Mozilla:)
I am trying to scrape http website using proxies and when I am trying to extract text, it shows as "Web page Blocked". How could I avoid this error?
My code is as follows
url = "http://campanulaceae.myspecies.info/"
proxy_dict = {
'http' : "174.138.54.49:8080",
'https' : "174.138.54.49:8080"
}
page = requests.get(url, proxies=proxy_dict)
soup = BeautifulSoup(page.text,'html.parser')
print(soup)
I get below output when I am trying to output text from the website.
<html>
<head>
<title>Web Page Blocked</title>
<meta content="text/html; charset=utf-8" http-equiv="Content-Type"/>
<meta content="NO-CACHE" http-equiv="PRAGMA"/>
<meta content="initial-scale=1.0" name="viewport"/>
........
<body bgcolor="#e7e8e9">
<div id="content">
<h1>Web Page Blocked</h1>
<p>Access to the web page you were trying to visit has been blocked in accordance with company policy. Please contact your system administrator if you believe this is in error.</p>
Because you did not specify a user-agent for the request headers.
Quite often, sites block requests that come from robot-like sources.
Try it like this:
headers = {'user-agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/80.0.3987.106 Safari/537.36'}
page = requests.get(url, headers=headers, proxies=proxy_dict)
I'm attempting to scrape listings from Autotrader.com using the following code:
import requests
session = requests.Session()
url = 'https://www.autotrader.com/cars-for-sale/Burlingame+CA-94010?searchRadius=10&zip=94010&marketExtension=include&isNewSearch=true&sortBy=relevance&numRecords=25&firstRecord=0'
homepage = session.get(url)
It looks like the connection was successfully established:
In[115]: homepage
Out[115]: <Response [200]>
However, accessing the homepage content shows an error message and nothing resembling the content accessible via browser:
In[121]: homepage.content
Out[121]:
<!DOCTYPE html>
<html lang="en" xml:lang="en" xmlns="http://www.w3.org/1999/xhtml">
<head>
<title>Autotrader - page unavailable</title>
(...)
<h1>We're sorry for any inconvenience, but the site is currently unavailable.</h1>
(...)
I've tried adding a different user agent in headers using user_agent:
headers = {'User-Agent': generate_user_agent()}
homepage = session.get(url, headers=headers)
But get the same result: page unavailable
I also tried pointing to a security certificate (the root one?) that I downloaded from Chrome:
certificate = './certificate/root.cer'
homepage = session.get(url, headers=headers, verify=certificate)
but I see the error:
File "/Users/michaelboles/Applications/anaconda3/lib/python3.7/site-packages/OpenSSL/_util.py", line 54, in exception_from_error_queue
raise exception_type(errors)
Error: [('x509 certificate routines', 'X509_load_cert_crl_file', 'no certificate or crl found')]
So I may not be doing that last part correctly.
Can anyone offer any help on retrieving Autotrader webpage content as it is displayed in the browser?
I don't know what User-Agent this user_agent module generates, but when I run:
import requests
url = 'https://www.autotrader.com/cars-for-sale/Burlingame+CA-94010?searchRadius=10&zip=94010&marketExtension=include&isNewSearch=true&sortBy=relevance&numRecords=25&firstRecord=0'
headers = {'User-Agent':'Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:72.0) Gecko/20100101 Firefox/72.0'} # <-- try this header
print( requests.get(url, headers=headers).text )
I get normal page:
<!DOCTYPE html>
<html>
<head><script>
window.NREUM||(NREUM={}),__nr_require=function(t,n,e){function r(e){if(!n[e]){var o=n[e]={exports:{}};t[e][0].call(o.exports,function(n){var o=t[e][1][n];return r(o||n)},o,o.exports)}return n[e].exports}if("function"==typeof __nr_require)return __nr_require;for(var o=0;o<e.length;o++)r(e[o]);return r}({1:[function(t,n,e){function r(t){try{s.console&&console.log(t)}catch(n){}}var o,i=t("ee"),a=t(15),s={};try{o=localStorage.getItem("__nr_flags").split(","),console&&"function"==typeof console.log&&(s.console=!0,o.indexOf("dev")!==-1&&(s.dev=!0),o.indexOf("nr_dev")!==-1&&(s.nrDev=!0))}catch(c){}s.nrDev&&i.on("internal-error",func
...