Missing headers on POST request with python requests - python

I'm currently using the python requests module to perform automated HTTP tasks on a website.
The problem is that I don't get the same results on my console as on my browser.
This is what I get when making a POST request on my browser:
This is what I get when making the POST request through the python requests module and running the .headers method on the request:
{
'Date': 'Fri, 14 Jul 2017 15:19:22 GMT',
'Content-Type': 'text/html; charset=utf-8',
'Transfer-Encoding': 'chunked',
'Connection': 'keep-alive',
'Cache-Control': 'private',
'Location': '/cart/view',
'Set-Cookie': 'png.notice=9Hz8GWQ38JQZqTrqcsnn1J5nfgIZt71orHtf71mI+rwqFpQg4RnV7BqZni/GgIS/SmUnC4jgnhjQuDhZNW2adxeLctG+bToT0wTTbgxe40t5RmbVv1viuH2gkL1eH2xN3IavOUBhVXm+JlQrmVnHLocqjgvWi8wAClLYmrShY1U2ege9; expires=Fri, 14-Jul-2017 15:34:03 GMT; path=/; HttpOnly',
'X-Powered-By': 'ASP.NET',
'X-UA-Compatible': 'IE=Edge,chrome=1',
'Server': 'cloudflare-nginx',
'CF-RAY': '37e575befbf43c35-CDG'
}
Notice how the two results are completely different.
I'm trying to get the "Location" header inside the Response headers (the one beginning with "https://live.adyen.com/hpp...".
What am I doing wrong here?
EDIT: This is my source code:
request = session.post('https://www.nakedcph.com/cart/process', data=user_info)
request.url
# outputs 'https://www.nakedcph.com/cart/view' (probably the issue)
request.headers
# outputs the headers (but not all of them?)
PS: After making the POST request, the website redirects to the URL inside the "Location" header from the Response headers.

I figured it out. Missed some parameters in the post request. My bad.

Related

Why am I getting a different response from server with URL passed from list to requests.get() vs. hard-coded string?

I'm trying to scrape pre-rendered JSON from multiple URLS from a particular server.
When I use requests.get() with a hardcoded URL, or a string-type variable containing a hard-coded URL, I get the JSON I want from server.
requests.get("https://example.url/example.cgi/example")
the .headers property on the response object returns:
{'Server': 'nginx', 'Date': 'Wed, 28 Oct 2020 00:49:22 GMT', 'Content-Type': 'application/json; charset=utf-8', 'Transfer-Encoding': 'chunked', 'Connection': 'keep-alive', 'X-Confex-Backend': 'es-director', 'Content-Encoding': 'gzip'}
But, when I pass the exact same url from a list to requests.get():
url_list = ['https://example.url/example.cgi/example']
for url in url_list:
requests.get(url)
I do not get the JSON response from the server. I get HTML instead with none of the JSON I want, shown here by the header to the response object (can't post the contents or the server URL here):
{'Server': 'nginx', 'Date': 'Wed, 28 Oct 2020 00:49:22 GMT', 'Content-Type': 'text/html; charset=UTF-8', 'Transfer-Encoding': 'chunked', 'Connection': 'keep-alive', 'Pragma': 'no-cache', 'Cache-Control': 'no-cache,no-store', 'Expires': 'Thu, 01 Jan 1970 00:00:00 GMT', 'X-Confex-Backend': 'weba13', 'Content-Encoding': 'gzip'}
I'm stumped. I've tried converting the list item variable to string, re-encoding it, etc... I've even tried switching up the order of the get requests in my testing and I get the same results. What is going on with requests.get() that a URL passed as a list item to the method gets a very different response from the server than the same URL when it's hard-coded into the method, or into a variable passed to the method? What am I missing? Suffice to say, for obvious reasons it would be great to iterate requests.get() through a list of URLs for this particular purpose...
I figured it out. -Facepalm-
So, in this particular case I had compiled the URLs in the list objects from scraping HTML on saidsame site. It turns out that those URLs are one character off from the hardcoded URLs I was seeing XHR GET requests use to get the JSON. app.cgi vs api.cgi in the middle of the URL. Now I know to run checks on that sort of thing in the future.

How can I request (get) and read an xml file using python?

I tried requesting an RSS feed on Treasury Direct using Python. In the past I've used urllib, or requests libraries to serve this purpose and it's worked fine. This time however, I continue to get the 406 status error, which I understand is the page's way of telling me it doesn't accept my header details from the request. I've tried altering it however to no avail.
This is how I've tried
import requests
url = 'https://www.treasurydirect.gov/TA_WS/securities/announced/rss'
user_agent = {'User-agent': 'Mozilla/5.0'}
response = requests.get(url, headers = user_agent)
print response.text
Environments: Python 2.7 and 3.4.
I also tried accessing via curl with the same exact error.
I believe this to be page specific, but can't figure out how to appropriately frame a request to read this page.
I found an API on the page which I can read the same data in json so this issue is now more of a curiosity to me than a true problem.
Any answers would be greatly appreciated!
Header Details
{'surrogate-control': 'content="ESI/1.0",no-store', 'content-language': 'en-US', 'x-content-type-options': 'nosniff', 'x-powered-by': 'Servlet/3.0', 'transfer-encoding': 'chunked', 'set-cookie': 'BIGipServerpl_www.treasurydirect.gov_443=3221581322.47873.0000; path=/; Httponly; Secure, TS01598982=016b0e6f4634928e3e7e689fa438848df043a46cb4aa96f235b0190439b1d07550484963354d8ef442c9a3eb647175602535b52f3823e209341b1cba0236e4845955f0cdcf; Path=/', 'strict-transport-security': 'max-age=31536000; includeSubDomains', 'keep-alive': 'timeout=10, max=100', 'connection': 'Keep-Alive', 'cache-control': 'no-store', 'date': 'Sun, 23 Apr 2017 04:13:00 GMT', 'x-frame-options': 'SAMEORIGIN', '$wsep': '', 'content-type': 'text/html;charset=ISO-8859-1'}
You need to add accept to headers request:
import requests
url = 'https://www.treasurydirect.gov/TA_WS/securities/announced/rss'
headers = {'accept': 'application/xml;q=0.9, */*;q=0.8'}
response = requests.get(url, headers=headers)
print response.text

Failed to upload file to server using Python Requests

I use Requests and python2.7 in order to fill some values in a form and upload (submit) an image to the server.
I actually execute the script from the server pointing to the file I need to upload. The file is located in the /home directory and I have made sure it has full permissions.
Although I get a 200 Response, nothing is uploaded. This is part of my code:
import requests
try:
headers = {
"Referer": 'url_for_upload_form',
"sessionid": sessionid # retrieved earlier
}
files = {'file': ('doc_file', open('/home/test.png', 'rb'))}
payload = {
'title': 'test',
'source': 'source',
'date': '2016-10-26 02:13',
'csrfmiddlewaretoken': csrftoken # retrieved earlier -
}
r = c.post(upload_url, files=files, data=payload, headers=headers)
print r.headers
print r.status_code
except:
print "Error uploading file"
As I said I get a 200 Response and the headers returned are:
{'Content-Language': 'en', 'Transfer-Encoding': 'chunked', 'Set-Cookie': 'csrftoken=fNfJU8vrvOLAnJ5h7QriPIQ7RkI755VQ; expires=Tue, 17-Oct-2017 08:04:58 GMT; Max-Age=31449600; Path=/', 'Vary': 'Accept-Language, Cookie', 'Server': 'nginx/1.6.0', 'Connection': 'keep-alive', 'Date': 'Tue, 18 Oct 2016 08:04:58 GMT', 'X-Frame-Options': 'SAMEORIGIN', 'Content-Type': 'text/html; charset=utf-8'}
Does anyone have any idea what I am doing wrong? Am I missing something basic here?

Python code to do GET request from pipedrive API

I am using python-pipedrive to wrap Pipedrive's API though it doesn't quite work out of the box on python3 (which I'm using) so I modified it. I'm having trouble with just the Http requests portion.
This is what taught me how to use Httplib2: https://github.com/jcgregorio/httplib2/wiki/Examples-Python3
Basically, I just want to send a GET request to this:
https://api.pipedrive.com/v1/persons/123?api_token=1234abcd1234abcd
This works:
from httplib2 import Http
from urllib.parse import urlencode
PIPEDRIVE_API_URL = "https://api.pipedrive.com/v1/persons/123?api_token=1234abcd1234abcd"
response, data = http.request(PIPEDRIVE_API_URL, method='GET',
headers={'Content-Type': 'application/x-www-form-urlencoded'})
However, Pipedrive returns an error 401 with 'You need to be authorized to make this request.' if I do this:
PIPEDRIVE_API_URL = "https://api.pipedrive.com/v1/"
parameters = 'persons/123'
api_token = '1234abcd1234abcd'
response, data = http.request(PIPEDRIVE_API_URL + parameters,
method='GET', body=urlencode(api_token),
headers={'Content-Type': 'application/x-www-form-urlencoded'})
The actual response is:
response =
{'server': 'nginx',
'status': '401',
'connection': 'keep-alive',
'set-cookie': 'pipe-session=7b6ddadbc67abdadb6a67dbadcb; path=/; domain=.pipedrive.com; secure; httponly',
'date': 'Sat, 11 Jun 2016 06:50:13 GMT',
'transfer-encoding': 'chunked',
'x-frame-options': 'SAMEORIGIN',
'content-type': 'application/json, charset=UTF-8',
'x-xss-protection': '1; mode=block'}
data =
{'success': False,
'error': 'You need to be authorized to make this request.'}
How do I properly provide the api_token as a parameter (body) to the GET request? Anyone know what I'm doing wrong?
You need to provide the api_token as a query parameter. Concatenate the stings like this
PIPEDRIVE_API_URL = "https://api.pipedrive.com/v1/"
route = 'persons/123'
api_token = '1234abcd1234abcd'
response, data = http.request(PIPEDRIVE_API_URL + route + '?api_token=' + api_token,
method='GET',
headers={'Content-Type': 'application/x-www-form-urlencoded'})

How to test if a link target is a pdf file if the link does not contain .pdf

I'm using selenium to scrape a bunch of files which are provided in a mix of formats and styles - trying to handle both html and pdf, and I've come across an issue when the target of a link is a pdf file, but the link itself does not contain '.pdf' e.g., and (note that one automatically downloads, and one just displays the file - at least in chrome - so there may need to be a test for two different types of pdf targets as well?)
Is there a way to tell programmatically if the target of a link is pdf that is more intelligent than just checking if it ends in .pdf?
I can't just download the file no matter the content, because I have distinct handling for the html files, where I want to follow secondary links and see if I can find pdfs, which won't work if the target is a pdf directly.
ETA: The accepted answer worked perfectly - the linked potential dupe is for testing on file system, not for download so I don't think that's valid, and certainly the answer below is better for this situation.
Selenium (or Chrome) checks the 'Content-Type' headers and choose what to do. You can also check the 'Content-Type' of a URL yourself use requests like below:
>>> r = requests.head('https://resus.org.au/?wpfb_dl=17')
>>> pprint.pprint(dict(r.headers))
{'Accept-Ranges': 'bytes',
'Age': '8518',
'Cache-Control': 'no-cache, must-revalidate, max-age=0',
'Connection': 'keep-alive',
'Content-Description': 'File Transfer',
'Content-Disposition': 'attachment; '
'filename="anzcor-guideline-6-compressions-apr-2021.pdf"',
'Content-Length': '535677',
'Content-Md5': '90AUQUZu0vFGJ7cBPvRxcg==',
'Content-Security-Policy': 'upgrade-insecure-requests',
'Content-Type': 'application/pdf',
'Date': 'Wed, 19 Jan 2022 11:20:06 GMT',
'Expires': 'Wed, 11 Jan 1984 05:00:00 GMT',
'Last-Modified': 'Wed, 19 Jan 2022 08:58:08 GMT',
'Pragma': 'no-cache',
'Server': 'openresty',
'Strict-Transport-Security': 'max-age=300, max-age=31536000; '
'includeSubDomains',
'Vary': 'User-Agent',
'X-Backend': 'local',
'X-Cache': 'cached',
'X-Cache-Hit': 'HIT',
'X-Cacheable': 'YES:Forced',
'X-Content-Type-Options': 'nosniff',
'X-Xss-Protection': '1; mode=block'}
As you can see, the 'Content-Type' of your two links are all 'application/pdf':
>>> r.headers['Content-Type']
'application/pdf'
So you can just check the output of requests.head(link).headers['Content-Type'], and do whatever you need.
For this moment (Jan 19 2022), the first link in your question redirects me to a 404 page. And the second one is still accessible, but it's needed to use HTTPS protocol by changing the link's start part from http:// to https://.
But anyway, if the URL doesn't redirect you to any other page, this answer isn't out-of-date. If the URL does, please request the newest URL by checking the status_code if it's a 301:
>>> r = requests.head('http://resus.org.au/?wpfb_dl=17')
>>> r.status_code
301
>>> r = requests.head('https://resus.org.au/?wpfb_dl=17')
>>> r.status_code
200
>>>

Categories

Resources