I'm writing some tests with Selenium and noticed, that Referer is missing from the headers. I wrote the following minimal example to test this with https://httpbin.org/headers:
import selenium.webdriver
options = selenium.webdriver.FirefoxOptions()
options.add_argument('--headless')
profile = selenium.webdriver.FirefoxProfile()
profile.set_preference('devtools.jsonview.enabled', False)
driver = selenium.webdriver.Firefox(firefox_options=options, firefox_profile=profile)
wait = selenium.webdriver.support.ui.WebDriverWait(driver, 10)
driver.get('http://www.python.org')
assert 'Python' in driver.title
url = 'https://httpbin.org/headers'
driver.execute_script('window.location.href = "{}";'.format(url))
wait.until(lambda driver: driver.current_url == url)
print(driver.page_source)
driver.close()
Which prints:
<html><head><link rel="alternate stylesheet" type="text/css" href="resource://content-accessible/plaintext.css" title="Wrap Long Lines"></head><body><pre>{
"headers": {
"Accept": "text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8",
"Accept-Encoding": "gzip, deflate, br",
"Accept-Language": "en-US,en;q=0.5",
"Connection": "close",
"Host": "httpbin.org",
"Upgrade-Insecure-Requests": "1",
"User-Agent": "Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:64.0) Gecko/20100101 Firefox/64.0"
}
}
</pre></body></html>
So there is no Referer. However, if I browse to any page and manually execute
window.location.href = "https://httpbin.org/headers"
in the Firefox console, Referer does appear as expected.
As pointed out in the comments below, when using
driver.get("javascript: window.location.href = '{}'".format(url))
instead of
driver.execute_script("window.location.href = '{}';".format(url))
the request does include Referer. Also, when using Chrome instead of Firefox, both methods include Referer.
So the main question still stands: Why is Referer missing in the request when sent with Firefox as described above?
Referer as per the MDN documentation
The Referer request header contains the address of the previous web page from which a link to the currently requested page was followed. The Referer header allows servers to identify where people are visiting them from and may use that data for analytics, logging, or optimized caching, for example.
Important: Although this header has many innocent uses it can have undesirable consequences for user security and privacy.
Source: https://developer.mozilla.org/en-US/docs/Web/HTTP/Headers/Referer
However:
A Referer header is not sent by browsers if:
The referring resource is a local "file" or "data" URI.
An unsecured HTTP request is used and the referring page was received with a secure protocol (HTTPS).
Source: https://developer.mozilla.org/en-US/docs/Web/HTTP/Headers/Referer
Privacy and security concerns
There are some privacy and security risks associated with the Referer HTTP header:
The Referer header contains the address of the previous web page from which a link to the currently requested page was followed, which can be further used for analytics, logging, or optimized caching.
Source: https://developer.mozilla.org/en-US/docs/Web/Security/Referer_header:_privacy_and_security_concerns#The_referrer_problem
Addressing the security concerns
From the Referer header perspective majority of security risks can be mitigated following the steps:
Referrer-Policy: Using the Referrer-Policy header on your server to control what information is sent through the Referer header. Again, a directive of no-referrer would omit the Referer header entirely.
The referrerpolicy attribute on HTML elements that are in danger of leaking such information (such as <img> and <a>). This can for example be set to no-referrer to stop the Referer header being sent altogether.
The rel attribute set to noreferrer on HTML elements that are in danger of leaking such information (such as <img> and <a>).
The Exit Page Redirect technique: This is the only method that should work at the moment without flaw is to have an exit page that you don’t mind having inside of the referer header. Many websites implement this method, including Google and Facebook. Instead of having the referrer data show private information, it only shows the website that the user came from, if implemented correctly. Instead of the referrer data appearing as http://example.com/user/foobar the new referrer data will appear as http://example.com/exit?url=http%3A%2F%2Fexample.com. The way the method works is by having all external links on your website go to a intermediary page that then redirects to the final page. Below we have a link to the website example.com and we URL encode the full URL and add it to the url parameter of our exit page.
Sources:
https://developer.mozilla.org/en-US/docs/Web/Security/Referer_header:_privacy_and_security_concerns#How_can_we_fix_this
https://geekthis.net/post/hide-http-referer-headers/#exit-page-redirect
This usecase
I have executed your code through both through GeckoDriver/Firefox and ChromeDriver/Chrome combination:
Code Block:
driver.get('http://www.python.org')
assert 'Python' in driver.title
url = 'https://httpbin.org/headers'
driver.execute_script('window.location.href = "{}";'.format(url))
WebDriverWait(driver, 10).until(lambda driver: driver.current_url == url)
print(driver.page_source)
Observation:
Using GeckoDriver/Firefox Referer: "https://www.python.org/" header was missing as follows:
{
"headers": {
"Accept": "text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8",
"Accept-Encoding": "gzip, deflate, br",
"Accept-Language": "en-US,en;q=0.5",
"Host": "httpbin.org",
"Upgrade-Insecure-Requests": "1",
"User-Agent": "Mozilla/5.0 (Windows NT 6.2; Win64; x64; rv:67.0) Gecko/20100101 Firefox/67.0"
}
}
Using ChromeDriver/Chrome Referer: "https://www.python.org/" header was present as follows:
{
"headers": {
"Accept": "text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3",
"Accept-Encoding": "gzip, deflate, br",
"Accept-Language": "en-US,en;q=0.9",
"Host": "httpbin.org",
"Referer": "https://www.python.org/",
"Upgrade-Insecure-Requests": "1",
"User-Agent": "Mozilla/5.0 (Windows NT 6.2; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/75.0.3770.80 Safari/537.36"
}
}
Conclusion:
It seems to be an issue with GeckoDriver/Firefox in handling the Referer header.
Outro
Referrer Policy
Related
I am trying to send a Http Post request to a website with these headers :
headers = {
"content-type": "application/x-www-form-urlencoded; charset=UTF-8",
"cookie": "__gpi=UID=00000625243f2b12:T=1654153135:RT=1654342443:S=ALNI_MbdFxSgua2dONohDTz9bEGks8vnoQ; __gads=ID=05dae5d77dbc463f:T=1654153135:S=ALNI_MbLIzKIHhP022gtr7bRBqu9PSxNtQ; PHPSESSID=8a932c5bbe4d667513dfdc3a0051ed37",
"origin": "https://www.dcode.fr",
"pragma": "no-cache",
"referer": "https://www.dcode.fr/cipher-identifier",
"sec-fetch-site": "same-origin",
"user-agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/101.0.4951.67 Safari/537.36 OPR/87.0.4390.45",
"x-requested-with": "XMLHttpRequest"
}
At first it is working perfectly.
But after some time this stop working.I think because cookies expire.
Erroneous Output :
{"captcha":"<script>$.getScript('https:\/\/www.google.com\/recaptcha\/api.js').done(function( script, textStatus ) {\n $('#captcha').addClass('g-recaptcha').attr({'data-sitekey':'6LeoCVQaAAAAALADLNorGItVJxP40YUjD1Q3S0zp','data-callback':'recaptcha_callback'});\n });\n<\/script>\n<div id='captcha'><\/div>"}
Expected output :
{"caption":"dCode's analyzer suggests to investigate:","results":{"<a href=\"\/rot-13-cipher\" target=\"_blank\">ROT-13 Cipher<\/a>":"\u25a0\u25a0","<a href=\"\/base-58-cipher\" target=\"_blank\">Base 58<\/a>":"\u25a0","<a href=\"\/playfair-cipher\" target=\"_blank\">PlayFair Cipher<\/a>":"\u25a0","<a href=\"\/base-64-encoding\" target=\"_blank\">Base64 Coding<\/a>":"\u25a0","<a href=\"\/substitution-cipher\" target=\"_blank\">Substitution Cipher<\/a>":"\u25aa","<a href=\"\/rot-cipher\" target=\"_blank\">ROT Cipher<\/a>":"\u25aa","<a href=\"\/caesar-cipher\" target=\"_blank\">Caesar Cipher<\/a>":"\u25aa","<a href=\"\/shift-cipher\" target=\"_blank\">Shift Cipher<\/a>":"\u25aa","<a href=\"\/hill-cipher\" target=\"_blank\">Hill Cipher<\/a>":"\u25aa","<a href=\"\/affine-cipher\" target=\"_blank\">Affine Cipher<\/a>":"\u25aa","<a href=\"\/keyboard-change-cipher\" target=\"_blank\">Keyboard Change Cipher<\/a>":"\u25ab","<a href=\"\/vigenere-cipher\" target=\"_blank\">Vigenere Cipher<\/a>":"\u25ab","<a href=\"\/homophonic-cipher\" target=\"_blank\">Homophonic Cipher<\/a>":"\u25ab","<a href=\"\/autoclave-cipher\" target=\"_blank\">Autoclave Cipher<\/a>":"\u25ab","<a href=\"\/beaufort-cipher\" target=\"_blank\">Beaufort Cipher<\/a>":"\u25ab","<a href=\"\/burrows-wheeler-transform\" target=\"_blank\">Burrows\u2013Wheeler Transform<\/a>":"\u25ab"}
If I copy the cookie from capturing requests using browser's developer tools and paste it in code, Then it will work again for a short amount of time.
How can I byoass this recaptcha error ?
the website or api is running some kind of js authentication to block anything that is not a browser to bypass this you have 2 options
either reverse the js and understand how the cookies are constructed and replicate them in python (this is very hard and might take weeks of reverse engineering)
or you can create a selenium instance that visits the site and waits for the cookies to be present then simply passes them to requests you will have to do this each time captcha is presented (this is the easier option but this will make your script slower)
This is not necessarily because cookies are expired, take a look at your output, it's a recaptcha. You need to solve the captcha first.
In addition to that, make sure you are changing requests' default useragent.
Consider using requests.Session if you are not using it already or alternatively selenium if possible
I see there are some URLs fetching metadata(json) for browser to render website i.e. when I hit example.com, in Firefox's developers view-> Network tab; some URL like https://example.com/server/getdata?cmd=showResults.
So, my question is I can access the that URL in new tab in same firefox window(expected json data). But I can't access the same URL in other firefox window(retuning empty json). It is maintaining some kind of session(may be with cookies?). I copied exact same http headers values from developers view and created python script with request at that moment to test. But the python script is retuning empty json
Example Screenshot
Python Code
parameter = {
"Accept": "application/json, text/javascript, */*; q=0.01",
"Accept-Encoding": "gzip, deflate, br",
"Accept-Language": "en-US,en;q=0.5",
"Cache-Control": "no-store, max-age=0",
"Connection": "keep-alive",
"Content-Length": "13175",
"Content-Type": "application/x-www-form-urlencoded",
"DNT": "1",
"Host": "in.example.com",
"Cookie": '__cfduid=xxxxxxxxxxxxxxxxx; __cfruid=xxxxxxx-1520022406; mqttuid=1.361660689',
"Referer": "https://in.example.com/page1/page2",
"User-Agent": "Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:58.0) Gecko/20100101 Firefox/58.0",
"X-Requested-With": "XMLHttpRequest"
}
response = requests.post(url="https://in.example.com/serv/getData?cmd=XXXX&type=XX&XXXX=1&_=1520022652009", data=parameter)
#print(dir(response))
print(response.headers)
print(response.json())
How can I simulate the session and directly hit the URL without hiting the root website?
PS: The site is static website
UPDATE1
changed header=parameters
response = requests.get(url="https://in.example.com/server1/getallData?cmd=xxxx&_=1520097652234", headers=parameter)
requests.exceptions.ConnectionError: HTTPSConnectionPool(host='in.example.com', port=443): Max retries exceeded with url: /server1/getallData?cmd=xxxxx&_=1520097652934 (Caused by <class 'ConnectionResetError'>: [Errno 104] Connection reset by peer)
Getting Connection Reset exception. Looks like CF is doing something? any ideas?
You are passing the "headers" as data?
You should use headers instead.
headers = {'User-Agent': 'Bot'}
requests.get('example.com/params', headers=headers)
I'm trying to automate the recovery of data from this website (The one I want is "
BVBG.086.01 PriceReport"). Checking with firefox, I found out that the request URL to which the POST is made is "http://www.bmf.com.br/arquivos1/lum-download_ipn.asp", and the parameters are:
hdnStatus: "ativo"
chkArquivoDownload_ativo "28"
txtDataDownload_ativo "09/02/2018"
imgSubmeter "Download"
txtDataDownload_externo_ativo [3]
0 "25/08/2017"
1 "25/08/2017"
2 "25/08/2017"
So, if I use hurl.it to make the request, the response is the correct 302 redirect (Pointing to a FTP URL where the requested files are, something like "Location: /FTP/Temp/10981738/Download.ex_"). (Example of the request here).
So I've tried doing the same with with the following code (Using python's library "requests", and I have tried both versions of request_body, trying to put it into the "data" parameter of the post method)
request_url = "http://www.bmf.com.br/arquivos1/lum-download_ipn.asp"
request_headers = {
"Host": "www.bmf.com.br",
"User-Agent": "Mozilla/5.0 (Windows NT 6.1; Win64; x64; rv:54.0) Gecko/20100101 Firefox/54.0",
"Accept": "text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8",
"Accept-Language": "en-US,en;q=0.5",
"Accept-Encoding": "gzip, deflate",
"Referer": "http://www.bmf.com.br/arquivos1/lum-arquivos_ipn.asp?idioma=pt-BR&status=ativo",
"Content-Type": "application/x-www-form-urlencoded",
"Content-Length": "236",
"Connection": "keep-alive",
"Upgrade-Insecure-Requests": "1"
}
# request_body = "hdnStatus=ativo&chkArquivoDownload_ativo=28&txtDataDownload_ativo=09/02/2018&imgSubmeter=Download&txtDataDownload_externo_ativo=25/08/2017&txtDataDownload_externo_ativo=25/08/2017&txtDataDownload_externo_ativo=25/08/2017"
request_body = {
"hdnStatus" : "ativo",
"chkArquivoDownload_ativo": "28",
"txtDataDownload_ativo": "09/02/2018",
"imgSubmeter": "Download",
"txtDataDownload_externo_ativo": ["25/08/2017", "25/08/2017", "25/08/2017"]
}
result_query = post(request_url, request_body, headers=request_headers)
# result_query = post(request_url, data=request_body, headers=request_headers)
for red in result_query.history:
print(BeautifulSoup(red.content, "lxml"))
print()
print(result_query.url)
And what I get is the following response:
<html><head><title>Object moved</title></head>
<body><h1>Object Moved</h1>This object may be found here.</body>
</html>
<html><head><title>Object moved</title></head>
<body><h1>Object Moved</h1>This object may be found here.</body>
</html>
<html><head><title>Object moved</title></head>
<body><h1>Object Moved</h1>This object may be found here.</body>
</html>
http://www.bmf.com.br/arquivos1/lum-arquivos_ipn.asp?idioma=pt-BR&status=ativo
And not the one I wanted (Which should point to the location of the file). What am I doing wrong here?
I am a beginner of Python. I just wrote a very simple web crawler and caused high memory usage when I was running the crawler.Not sure what's wrong in my codes, I spent quite some time but can't resolve it.
I intend to use it to capture some job info from following link: http://search.51job.com/jobsearch/search_result.php?fromJs=1&jobarea=070200%2C00&district=000000&funtype=0000&industrytype=00&issuedate=9&providesalary=06%2C07%2C08%2C09%2C10&keywordtype=2&curr_page=1&lang=c&stype=1&postchannel=0000&workyear=99&cotype=99°reefrom=99&jobterm=99&companysize=99&lonlat=0%2C0&radius=-1&ord_field=0&list_type=0&dibiaoid=0&confirmdate=9
The crawler extracts the links of each job,and generates the id of each job from the links. Then it reads the job title from the link through xpath, print all the info out in the end. Even the link number is only 50, but it caused my computer nearly unresponsive every time before printing out all the info. Below is my codes.
I just added the header, this is needed to parse the link of each job. My environment is Ubuntu16.04, Python3.5,Pycharm.
import requests
from lxml import etree
import re
headers = {"Accept": "text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8",
"Accept-Encoding": "gzip, deflate",
"Accept-Language": "en-US,en;q=0.5",
"Connection": "keep-alive",
"Host": "jobs.51job.com",
"Upgrade-Insecure-Requests": "1",
"User-Agent": "Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:52.0) Gecko/20100101 Firefox/52.0"}
def generate_info(url):
html = requests.get(url, headers=headers)
html.encoding = 'GBK'
select = etree.HTML(html.text.encode('utf-8'))
job_id = re.sub('[^0-9]', '', url)
job_title=select.xpath('/html/body//h1/text()')
print(job_id,job_title)
sum_page='http://search.51job.com/jobsearch/search_result.php?fromJs=1&jobarea=070200%2C00&district=000000&funtype=0000&industrytype=00&issuedate=9&providesalary=06%2C07%2C08%2C09%2C10&keywordtype=2&curr_page=1&lang=c&stype=1&postchannel=0000&workyear=99&cotype=99°reefrom=99&jobterm=99&companysize=99&lonlat=0%2C0&radius=-1&ord_field=0&list_type=0&dibiaoid=0&confirmdate=9'
sum_html=requests.get(sum_page)
sum_select=etree.HTML(sum_html.text.encode('utf-8'))
urls= sum_select.xpath('//*[#id="resultList"]/div/p/span/a/#href')
for url in urls:
generate_info(url)
I have a script where I am trying to search a google page via selenium to test something. Whenever I open up Webdriver, I get a captcha form:
fp = webdriver.FirefoxProfile()
driver = webdriver.Firefox(firefox_profile=fp)
driver.get('https://www.google.com/search?q=asdf')
However, if I open the exact same page, https://www.google.com/search?q=asdf, in a browser, it works fine. Why does Google raise the captcha, and what parameters can I send with webdriver such that it 'looks' like a normal browser and the captcha isn't raised?
Note, I have tried adding my user agent, and it still raises the same error:
fp = webdriver.FirefoxProfile()
fp.set_preference("general.useragent.override","Mozilla/5.0 (Macintosh; Intel Mac OS X 10.8; rv:32.0) Gecko/20100101 Firefox/32.0")
driver = webdriver.Firefox(firefox_profile=fp)
Here is an example of my Request headers from the normal browser:
you need to set the user agent.
See this SO ANSWER
on using set_preference.
Pass all the headers using requests:
headers = {
"Host": "www.google.com",
"User-Agent": "Mozilla/5.0 (Macintosh; Intel Mac OS X 10.8; rv:32.0) Gecko/20100101 Firefox/32.0",
"Accept": "text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8",
"Accept-Language": "en-US,en;q=0.5",
"Accept-Encoding": "gzip, deflate",
"Cookie": "PREF=ID=0df7e6fbda0c09d3:U=bfc47b624b57a0e9:FF=0:TM=1414961297:LM=1414961298:S=2FtJad1BEeJ0M5XS; NID=67=t5zTrFVtG4cLZH2kVmsQEbqDRFJisM86z1s27zx0A6vTR0MWqg69DaY39muso6fIEgqnli7IaEv1Rge1ZxBG0Nr1_3KH1aLu_z1-Ar48oiVDFFSVX4KDRgWnHQWjUfHC",
"Connection": "keep-alive",
"Cache-Control": "max-age=0",
}