I am facing a strange problem, I dont know much about for my lack of knowledge of html. I want to download an excel file post login from a website.
The file_url is:
file_url="https://xyz.xyz.com/portal/workspace/IN AWP ABRL/Reports & Analysis Library/CDI Reports/CDI_SM_Mar'20.xlsx"
There is a share button for the file which gives the link2 (For the same file):
file_url2='http://xyz.xyz.com/portal/traffic/4a8367bfd0fae3046d45cd83085072a0'
When I use requests.get to read link 2 (post login to a session) I am able to read the excel into pandas. However, link 2 does not serve my purpose as I cant schedule my report on this on a periodic basis (by changing Mar'20 to Apr'20 etc). Link1 suits my purpose but gives the following on passing r=requests.get in the r.content method:
b'\n\n\n\n\n\n\n\n\n\n<html>\n\t<head>\n\t\t<title></title>\n\t</head>\n\t\n\t<body bgcolor="#FFFFFF">\n\t\n\n\t<script language="javascript">\n\t\t<!-- \n\t\t\ttop.location.href="https://xyz.xyz.com/portal/workspace/IN%20AWP%20ABRL/Reports%20&%20Analysis%20Library/CDI%20Reports/CDI_SM_Mar\'20.xlsx";\t\n\t\t-->\n\t</script>\n\t</body>\n</html>'
I have tried all encoding decoding of url but cant understand this alphanumeric url (link2).
My python code (working) is:
import requests
url = 'http://xyz.xyz.com/portal/site'
username=''
password=''
s = requests.Session()
headers = {'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_4) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/81.0.4044.138 Safari/537.36'}
r = s.get(url,auth=(username, password),verify=False,headers=headers)
r2 = s.get(file_url,verify=False,allow_redirects=True)
r2.content
# df=pd.read_excel(BytesIO(r2.content))
You get HTML with JavaScript which redirects browser to new url. But requests can't run JavaScript. it is simple methods to block some simple scripts/bots.
But HTML is only string so you can use string's functions to get url from string and use this url with requests to get file.
content = b'\n\n\n\n\n\n\n\n\n\n<html>\n\t<head>\n\t\t<title></title>\n\t</head>\n\t\n\t<body bgcolor="#FFFFFF">\n\t\n\n\t<script language="javascript">\n\t\t<!-- \n\t\t\ttop.location.href="https://xyz.xyz.com/portal/workspace/IN%20AWP%20ABRL/Reports%20&%20Analysis%20Library/CDI%20Reports/CDI_SM_Mar\'20.xlsx";\t\n\t\t-->\n\t</script>\n\t</body>\n</html>'
text = content.decode()
print(text)
print('\n---\n')
start = text.find('href="') + len('href="')
end = text.find('";', start)
url = text[start:end]
print('url:', url)
response = s.get(url)
Results:
<html>
<head>
<title></title>
</head>
<body bgcolor="#FFFFFF">
<script language="javascript">
<!--
top.location.href="https://xyz.xyz.com/portal/workspace/IN%20AWP%20ABRL/Reports%20&%20Analysis%20Library/CDI%20Reports/CDI_SM_Mar'20.xlsx";
-->
</script>
</body>
</html>
---
url: https://xyz.xyz.com/portal/workspace/IN%20AWP%20ABRL/Reports%20&%20Analysis%20Library/CDI%20Reports/CDI_SM_Mar'20.xlsx
Related
I need to get html code of the website but I get only 403 error or 403 status_code
import urllib
req = urllib.request.Request(url)
req.add_header('User-Agent', 'Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.9.0.7) Gecko/2009021910 Firefox/3.0.7')
response = urllib.request.urlopen(req)
data = response.read() # a `bytes` object
html = data.decode('utf-8') # a `str`; this step can't be used if data is binary
print(html)
It will give this error
urllib.error.HTTPError: HTTP Error 403: Forbidden
And I tried this too. (verify=False doesn't work too)
import requests
res = requests.get('https://santehnika-online.ru/cart-link/b15caf63da6313698633f41747b9d9eb/', headers={'user-agent':'Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.9.0.7) Gecko/2009021910 Firefox/3.0.7'})
print(res.text, res.status_code)
I got it. It is checking browser page, but it doesn't exist if I open this website in browser
<!DOCTYPE html>
<html lang="ru">
<head>
<title>Проверка браузера</title>
<meta http-equiv="Content-Type" content="text/html; charset=utf-8" />
<meta http-equiv="X-UA-Compatible" content="IE=edge" />
<meta name="viewport" content="width=device-width,initial-scale=1.0" />
<link rel="preconnect" href="https://fonts.googleapis.com">
<link rel="preconnect" href="https://fonts.gstatic.com" crossorigin>
</header>
<div class="message">
<div class="wrapper wrapper_message">
<div class="message-content">
<div class="message-title">Проверяем браузер</div>
<div>Сайт скоро откроется</div>
</div>
<div class="loader"></div>
</div>
</div>
<div class="captcha">
<form id="challenge-form" class="challenge-form" action="/cart-link/b15caf63da6313698633f41747b9d9eb/?__cf_chl_f_tk=iRm3UPbqu59isE.16Z5X1Rx8TYQzALW_hvQcP9ji_Rc-1676103711-0-gaNycGzNCXs" method="POST" enctype="application/x-www-form-urlencoded">
<div id="cf-please-wait">
<div id="spinner">
Help me please to get 200 status_code and good hrml code.
If getting a status code of 403, according to google:
An HTTP 403 response code means that a client is forbidden from accessing a valid URL. The server understands the request, but it can't fulfill the request because of client-side issues.
So this means that the website is inaccessible.
Best way to get html websites (in my oppunion) is using Beautifulsoup library
simple
import urllib2
from bs4 import BeautifulSoup
# Fetch the html file
response = urllib2.urlopen('http://tutorialspoint.com/python/python_overview.htm')
html_doc = response.read()
# Parse the html file
soup = BeautifulSoup(html_doc, 'html.parser')
# Format the parsed html file
strhtm = soup.prettify()
# Print the first few characters
print (strhtm[:225])
the best part for this library is usage like this
this part is extracting tag value
import urllib2
from bs4 import BeautifulSoup
response = urllib2.urlopen('http://tutorialspoint.com/python/python_overview.htm')
html_doc = response.read()
soup = BeautifulSoup(html_doc, 'html.parser')
print (soup.title)
print(soup.title.string)
print(soup.a.string)
print(soup.b.string)
also about the error maybe IP blocking you can use proxy or other agents
but I think maybe website is using cookies you can track the website using console network in inspect to check if any suspicion cookies you get after going inside the website and add that cookie to your request and see the results
The site is protected by cloudflare technology, which is currently one of the best. The cloudflare checks if the client can run javascript (this is the biggest difference between python requests and the browser).
Use these projects that are designed to pass through cloudflare's protection:
cloudscrape
undetected selenium
Thanks in advance for looking into this query.
I am trying to extract data from the angular response which is not visible in the HTML code using the inspect function of Chrome browser.
I researched and looked for solutions and have been able to find the data in the Network (tab)>Fetch/XHR>Response (screenshots) and also wrote code based on the knowledge I gained researching this topic.
Response
In order to read the response I am trying the below code by passing the parameters and cookies grabbed from the main URL
and passing them into the request via the below code segment from the main code shared further below. The parameters were created based on information I found under tab Network (tab)>Fetch/XHR>Header
http = urllib3.PoolManager()
r = http.request('GET',
'https://www.barchart.com/proxies/core-api/v1/quotes/get?' + urlencode(params),
headers=headers
)
QUESTIONS
Please help confirm what am I missing or doing wrong? I want to read and store the json response what should I be doing?
JSON to be extracted
Also is there a way to read the params using a function?, instead of assigning them as I have done below. What I mean is similar to what I have done for cookies (headers = x.cookies.get_dict()) is there a way to read and assign parameters?
Below is the full code I am using.
import requests
import urllib3
from urllib.parse import urlencode
url = 'https://www.barchart.com/etfs-funds/performance/percent-change/advances?viewName=main&timeFrame=5d&orderBy=percentChange5d&orderDir=desc'
header = {'accept': 'application/json', 'user-agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/102.0.0.0 Safari/537.36'}
s = requests.Session()
x = s.get(url, headers=header)
headers = x.cookies.get_dict()
params = { 'lists': 'etfs.us.percent.advances.unleveraged.5d',
'orderDir': 'desc',
'fields': 'symbol,symbolName,lastPrice,weightedAlpha,percentChangeYtd,percentChange1m,percentChange3m,percentChange1y,symbolCode,symbolType,hasOptions',
'orderBy': 'percentChange',
'meta': 'field.shortName,field.type,field.description,lists.lastUpdate',
'hasOptions': 'true',
'page': '1',
'limit': '100',
'in(leverage%2C(1x))':'',
'raw': '1'}
http = urllib3.PoolManager()
r = http.request('GET',
'https://www.barchart.com/proxies/core-api/v1/quotes/get?' + urlencode(params),
headers=headers
)
r.data
r.data response is below, returning an error.
<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN"
"http://www.w3.org/TR/html4/loose.dtd">\n<HTML><HEAD><META
HTTP-EQUIV="Content-Type" CONTENT="text/html;
charset=iso-8859-1">\n<TITLE>ERROR: The request could not be
satisfied</TITLE>\n</HEAD><BODY>\n<H1>403 ERROR</H1>\n<H2>The request
could not be satisfied.</H2>\n<HR noshade size="1px">\nRequest
blocked.\nWe can\'t connect to the server for this app or website at
this time. There might be too much traffic or a configuration error.
Try again later, or contact the app or website owner.\n<BR
clear="all">\nIf you provide content to customers through CloudFront,
you can find steps to troubleshoot and help prevent this error by
reviewing the CloudFront documentation.\n<BR clear="all">\n<HR noshade
size="1px">\n<PRE>\nGenerated by cloudfront (CloudFront)\nRequest ID:
vcjzkFEpvdtf6ihDpy4dVkYx1_lI8SUu3go8mLqJ8MQXR-KRpCvkng==\n</PRE>\n<ADDRESS>\n</ADDRESS>\n</BODY></HTML>
You can get reponse by name, on your screenshot name get?lists=etfs.us is what you need, you also need to install playwright
There is a guide here: https://www.zenrows.com/blog/web-scraping-intercepting-xhr-requests#use-case-nseindiacom
from playwright.sync_api import sync_playwright
url = "https://www.barchart.com/etfs-funds/performance/percent-change/advances?viewName=main&timeFrame=5d&orderBy=percentChange5d&orderDir=desc"
with sync_playwright() as p:
def handle_response(response):
# the endpoint we are insterested in
if ("get?lists=etfs.us" in response.url):
print(response.json())
browser = p.chromium.launch()
page = browser.new_page()
page.on("response", handle_response)
page.goto(url, wait_until="networkidle")
page.context.close()
browser.close()
I am trying to get data from a page. I've tried to read the posts of other people who had the same problem, Making a get request first to get cookies, setting headers, none of it works. When I examine the output of print(soup.title.get_text()) I still end up getting "Log In" as the title returned. The login_data has the same key names as the HTML <input> elements, e.g <input name=ctl00$cphMain$logIn$UserName ...> for username and <input name=ctl00$cphMain$logIn$Password ...> for password. Not sure what to do next. I can't use selenium, as I have to execute this script on an EC2 instance that's running a splunk server.
import requests
from bs4 import BeautifulSoup
link = "****"
login_URL = "https://erecruit.elwoodstaffing.com/Login.aspx"
login_data = {
"ctl00$cphMain$logIn$UserName": "****",
"ctl00$cphMain$logIn$Password": "****"
}
with requests.Session() as session:
z = session.get(login_URL)
session.headers = {
'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_11_5) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/51.0.2704.63 Safari/537.36',
'Content-Type':'application/json;charset=UTF-8',
}
post = session.post(login_URL, data=login_data)
response = session.get(link)
html = response.text
soup = BeautifulSoup(html, "html.parser")
print(soup.title.get_text())
I actually found the answer.
You can basically just go to the network tab using chrome, and then copy requests as a cURL statement. Then, just use a website or tool to convert the cURL statement to its programming language equivalent (Python, node, java, and so forth).
I'm attempting to scrape listings from Autotrader.com using the following code:
import requests
session = requests.Session()
url = 'https://www.autotrader.com/cars-for-sale/Burlingame+CA-94010?searchRadius=10&zip=94010&marketExtension=include&isNewSearch=true&sortBy=relevance&numRecords=25&firstRecord=0'
homepage = session.get(url)
It looks like the connection was successfully established:
In[115]: homepage
Out[115]: <Response [200]>
However, accessing the homepage content shows an error message and nothing resembling the content accessible via browser:
In[121]: homepage.content
Out[121]:
<!DOCTYPE html>
<html lang="en" xml:lang="en" xmlns="http://www.w3.org/1999/xhtml">
<head>
<title>Autotrader - page unavailable</title>
(...)
<h1>We're sorry for any inconvenience, but the site is currently unavailable.</h1>
(...)
I've tried adding a different user agent in headers using user_agent:
headers = {'User-Agent': generate_user_agent()}
homepage = session.get(url, headers=headers)
But get the same result: page unavailable
I also tried pointing to a security certificate (the root one?) that I downloaded from Chrome:
certificate = './certificate/root.cer'
homepage = session.get(url, headers=headers, verify=certificate)
but I see the error:
File "/Users/michaelboles/Applications/anaconda3/lib/python3.7/site-packages/OpenSSL/_util.py", line 54, in exception_from_error_queue
raise exception_type(errors)
Error: [('x509 certificate routines', 'X509_load_cert_crl_file', 'no certificate or crl found')]
So I may not be doing that last part correctly.
Can anyone offer any help on retrieving Autotrader webpage content as it is displayed in the browser?
I don't know what User-Agent this user_agent module generates, but when I run:
import requests
url = 'https://www.autotrader.com/cars-for-sale/Burlingame+CA-94010?searchRadius=10&zip=94010&marketExtension=include&isNewSearch=true&sortBy=relevance&numRecords=25&firstRecord=0'
headers = {'User-Agent':'Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:72.0) Gecko/20100101 Firefox/72.0'} # <-- try this header
print( requests.get(url, headers=headers).text )
I get normal page:
<!DOCTYPE html>
<html>
<head><script>
window.NREUM||(NREUM={}),__nr_require=function(t,n,e){function r(e){if(!n[e]){var o=n[e]={exports:{}};t[e][0].call(o.exports,function(n){var o=t[e][1][n];return r(o||n)},o,o.exports)}return n[e].exports}if("function"==typeof __nr_require)return __nr_require;for(var o=0;o<e.length;o++)r(e[o]);return r}({1:[function(t,n,e){function r(t){try{s.console&&console.log(t)}catch(n){}}var o,i=t("ee"),a=t(15),s={};try{o=localStorage.getItem("__nr_flags").split(","),console&&"function"==typeof console.log&&(s.console=!0,o.indexOf("dev")!==-1&&(s.dev=!0),o.indexOf("nr_dev")!==-1&&(s.nrDev=!0))}catch(c){}s.nrDev&&i.on("internal-error",func
...
I am trying to pull the ingredients list from the following webpage:
https://skinsalvationsf.com/2012/08/updated-comedogenic-ingredients-list/
So the first ingredient I want to pull would be Acetylated Lanolin, and the last ingredient would be Octyl Palmitate.
Looking at the page source for this URL, I learn that the pattern for the ingredients list looks like this:
<td valign="top" width="33%">Acetylated Lanolin <sup>5</sup></td>
So I wrote some code to pull the list, and it is giving me zero results. Below is the code.
import requests
r = requests.get('https://skinsalvationsf.com/2012/08/updated-comedogenic-ingredients-list/')
from bs4 import BeautifulSoup
soup = BeautifulSoup(r.text, 'html.parser')
results = soup.find_all('td', attrs={'valign':'top'})
When I try len(results), it gives me a zero.
What am I doing wrong? Why am I not able to pull the list as intended? I am a beginner to web scrapers.
Your web scraping code is working as intended. However, your request did not work. If you check the status code of your request, you can see that you get a 403 status.
r = requests.get('https://skinsalvationsf.com/2012/08/updated-comedogenic-ingredients-list/')
print(r.status_code) # 403
What happens is that the server does not allow a non-browser request. To make it work, you need to use a header while making the request. This header should be similar to what a browser would send:
headers = {
'User-Agent': ('Mozilla/5.0 (Windows NT 6.1; WOW64) '
'AppleWebKit/537.36 (KHTML, like Gecko) '
'Chrome/56.0.2924.76 Safari/537.36')
}
r = requests.get('https://skinsalvationsf.com/2012/08/updated-comedogenic-ingredients-list/', headers=headers)
from bs4 import BeautifulSoup
soup = BeautifulSoup(r.text, 'html.parser')
results = soup.find_all('td', attrs={'valign':'top'})
print(len(results))
Your soup request is forbidden.
Hence you can not crawl it. Seems website is blocking scraping.
print(soup)
<html>
<head><title>403 Forbidden</title></head>
<body bgcolor="white">
<center><h1>403 Forbidden</h1></center>
<hr/><center>nginx</center>
</body>
</html>