I'm trying to scrape a website using requests. However, a post method that I need to use requires the headers below. I can fill everything apart from the JSESSION ID. The only way I can get this post method to work is if I manually go into the browser, start a session and inspect the page to retrieve the JSESSIONID.
I am looking for a way to retrieve this JSESSIONID using the requests package in python. I saw some suggestions for using a session. However, the requests session does not grab the JSESSIONID, which is the only thing I need. How should I go about a possible solution?
Host:
Connection:
Content-Length:
Accept:
X-Requested-With:
User-Agent:
Content-Type:
Sec-GPC:
Origin:
Sec-Fetch-Site:
Sec-Fetch-Mode:
Sec-Fetch-Dest:
Referer:
Accept-Encoding:
Accept-Language:
Cookie: _1aa19=; JSESSIONID=;
What I currently tried is use a session from the requests package, which should store the cookies of the session. However, After I use a .get method requests.cookies does not have the JSESSIONID stored
query = 'Example%20query'
s = requests.Session()
suggest = s.get(f'https://www.examplewebsite.nl/api_route/suggest?query={query}').json()
s.cookies
JSESSIONID is generated when you go to https://www.examplewebsite.nl page first.
import requests
query = 'Example%20query'
s = requests.Session()
s.get('https://www.examplewebsite.nl')
suggest = s.get(f'https://www.examplewebsite.nl/api_route/suggest?query={query}').json()
print(s.cookies.get("JSESSIONID"))
Related
Im trying to login to this website, seeking.com/login through scrapy shell. i also installed burp suite to analyze its url and headers, etc.
from scrapy.http import FormRequest
frmdata = {"captcha":"","email":"MYEMAIL.com","password":"MY_PASSWORD","is_rememberme":"0","locale":"en_US","auth_type":"bearer_token","date":"2018-12-13T09:56:22.957Z"}
url = "https://www.seeking.com/v3/auth/login"
r = FormRequest(url, formdata=frmdata)
fetch(r)
with this code i get a HTTP 401 Error, as far as i can tell essentially an authentication error.
I forwarded the calls through burpsuite and got the following intercept.
POST /v3/auth/login HTTP/1.1
Host: www.seeking.com
User-Agent: Mozilla/5.0 (Macintosh; Intel Mac OS X 10.14; rv:63.0)
Gecko/20100101 Firefox/63.0
Accept: application/json, text/plain, */*
Accept-Language: en-US,en;q=0.5
Accept-Encoding: gzip, deflate
Referer: https://www.seeking.com/login?fromLogout=1
Content-Type: application/json;charset=utf-8
Web-Version: 3.59
Authorization: Basic NTI2ZTIwYzExMDI0NDYzNTk5OTI5MzUwZThiNWEzMTI6bHN0emd4ZzpSRzRzS3VmdEJMRTQxMm92TnMxbDR6L0ZkZ1dESHZuM2wwZWxtYWhyMGtnPQ==
Content-Length: 166
Connection: close
Cookie: __cfduid=dcf9fd66583d55382f362c18a83d904ca1544519479;
_gcl_au=1.1.2035701377.1544519485; _ga=GA1.2.1740241044.1544519486;
com.silverpop.iMAWebCookie=e88c45d1-3c24-11c6-089e-e287aae2c678;
__cfruid=3eebbdc1e401ed560c23a7c474c41e59b2e93018-1544520179;
device_cookie=1; __gads=ID=a1e437c03ddad1b3:T=1544519579:S=ALNI_MYb30xY4z76J4NniCK_ZtOyOdPMKA;_lb_user=gfpuzje6kg; seeking_session=eyJpdiI6Im4yMTNJNVNRZjkxbnZzMmNpYnQ4dkE9PSIsInZhbHVlIjoiVGhGVUJDejc1dElJbEwxekh5d2hXUnhjeDlpVWR2dW9IWWJqeDZvRmI3VU9Pc1lpZXZGWGJxejQ1alNXbGVXUGJqaEpORU9LNFJITVh0N3IwR1E0bUE9PSIsIm1hYyI6IjUyODU3MWIxYjM3MGU3M2E0YjI1YzM2MzNmNDc5ZDMzZDdjYTg1ZWMxYWU2ODJjY2JlMTJmZWJlNmUyZDkyNWMifQ%3D%3D {"captcha":"","email":"MYEMAIL","password":"MYPASS","is_rememberme":0,"locale":"en_US","auth_type":"bearer_token","date":"2018-12-14T09:15:56.016Z"}
I am completely new to this, and have spent 2 days trying to figure out what i need to pass to this POST to login.
My question is
1) based on this intercept what should my request via FormRequest look like?
2) I see there are cookies/authorization (Authorization token, that changes with each POST, session cookies, etc) tokens that are being passed in to the post... Where do they come from? How do i get them when i am scraping so that i can successfully login?
3) Do i need to store these session variables when scraping other pages on the site after login? Anything special i need to do to stay logged in to access other pages?
It looks like the login page is expecting to be passed soon data, and not a url-encoded string (which is what FormRequest will create).
Something like this should work:
r = scrapy.Request(
url=url,
method='POST',
body=json.dumps(frmdata),
headers={'Content-Type': 'application/json'},
)
The tokens, cookies, etc. are probably created when you initially request the login page, so you might need to request the login page before trying to log in.
It is possible that some of it is generated with javascript (haven't checked), so you might need to dig through the js code to figure out what's going on, or even execute the js yourself (e.g. using a browser).
Scrapy will keep track of your session for you, so there's nothing you need to do to stay logged in.
I am trying to do a post request on python utilizing the requests library, when I set my custom headers which are the following:
User-Agent: MBAM-C
Content-Type: application/json
Authorization: True
Content-Length: 619
Connection: Close
However when It send the request with the custom headers it adds its own headers which give a bad request response from the server..
User-Agent: MBAM-C
Accept-Encoding: gzip, deflate
Accept: */*
Connection: Close
Content-Type: application/json
Authorization: True
Content-Length: 559
It is due to the design goals of the requests project.
This behavior is documented here. You may want to use a lower level library, if it is problematic for the library to be correcting content length or adding desirable headers. Requests bills itself as: "an elegant and simple HTTP library for Python, built for human beings." and a part of that is advertising that it can accept compressed content and all mime types.
Note: Custom headers are given less precedence than more specific sources of information. For instance:
Authorization headers set with headers= will be overridden if credentials are specified in .netrc, which in turn will be overridden by the auth= parameter.
Authorization headers will be removed if you get redirected off-host.
Proxy-Authorization headers will be overridden by proxy credentials provided in the URL.
Content-Length headers will be overridden when we can determine the length of the content.
I know there are tons of ways to add headers or cookies something like this. But what I want to do is to add "\r\n" on the top of the request so as to look like the following body.
Request Body >>
\r\n <-- technically invisible..
GET /path/ HTTP/1.1
Host: www.website.com
Connection: keep-alive
Accept: text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8
User-Agent: Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.22 (KHTML, like Gecko) Chrome/25.0.1364.97 Safari/537.22
Referer: https://www.google.com/
Accept-Encoding: gzip,deflate,sdch
Accept-Language: en-US,en;q=0.8,ko;q=0.6
Accept-Charset: windows-949,utf-8;q=0.7,*;q=0.3
\r\n is added on the first line of the GET request as you can see.
It's like adding an empty line.
How can I do this in Python?
I've spent hours on this topic but couldn't find any useful resources.
===================== ADD ============================================
It's about hacking.
In South Korea, Government restricts some sites, but the filters preventing users from connecting to the sites can easily be evaded by just adding "\r\n" on the top of the request body.
httplib2, httplib, urllib, urllib2, etc.. etc..
Whatever library to be used, I just need to add "\r\n" to the request body.
You could do this by monkeypatching the httplib.HTTPConnection class; urllib, urllib2, requests etc. all use that class to handle the low-level HTTP conversation.
The easiest is to patch the HTTPConnection._output() method to insert the extra characters before a HTTP version message:
from httplib import HTTPConnection, _CS_REQ_STARTED
orig_output = HTTPConnection._output
def add_initial_newline_output(self, s):
if (self._HTTPConnection__state == _CS_REQ_STARTED and
s.endswith(self._http_vsn_str) and not self._buffer):
self._buffer.append('') # will insert extra \r\n
orig_output(self, s)
HTTPConnection._output = add_initial_newline_output
This will only insert the extra starting empty line when the connection is in the correct state (request started), the line ends with the current HTTP version string, and the buffer is still empty.
so I have made a request to a server with pythons request library. The code looks like this (it uses an adapter so it needs to match a certain pattern)
def getRequest(self, url, header):
"""
implementation of a get request
"""
conn = requests.get(url, headers=header)
newBody = conn.content
newHeader = conn.headers
newHeader['status'] = conn.status_code
response = {"Headers" : newHeader, "Body" : newBody.decode('utf-8')}
self._huddleErrors.handleResponseError(response)
return response
the header parameter I am parsing in is this
{'Authorization': 'OAuth2 handsOffMyToken', 'Accept': 'application/vnd.huddle.data+json'}
however I am getting an xml response back from the server. After checking fiddler I see the request being sent is:
Accept-Encoding: identity
Accept: */*
Host: api.huddle.dev
Authorization: OAuth2 HandsOffMyToken
Accept: application/vnd.huddle.data+json
Accept-Encoding: gzip, deflate, compress
User-Agent: python-requests/1.2.3 CPython/3.3.2 Windows/2008ServerR2
As we can see there are 2 Accept Headers! The requests library is adding in this Accept:* / * header which is throwing off the server. Does anyone know how I can stop this?
As stated in comments it seems this is a problem with the requests library in 3.3. In requests there are default headers (which can be found in the utils folder). When you don't specify your own headers these default headers are used. However if you specify your own headers instead requests tries to merge the headers together to make sure you have all the headers you need.
The problem shows its self in def request() method in sessions.py. Instead of merging all the headers it puts in its headers and then chucks in yours. For now I have just done the dirty hack of removing the accept header from the default headers found in util
I am trying to fetch data from a webpage using urllib2. The page is visible on the browser but through the script I keep getting HTTPError: HTTP Error 403: Forbidden
I also tried mimicking a browser request by changing the user-agent string but no success.
Any ideas on this?
I tried with tamper data and firefox to send only user agent, and I get 403.
Try to add other headers:
Accept: text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8
Accept-Language: en-us,en;q=0.5
Accept-Charset: ISO-8859-1,utf-8;q=0.7,*;q=0.7
Keep-Alive: 115
Connection: keep-alive
I tried, and this should work.
The site is checking your User-Agent just set it to Internet Explorer:
request.add_header('User-Agent', 'Internet Explorer')
I confirmed that this works with wget, and you get 403 unless you set your user agent to Internet Explorer.
:) Am trying to get quotes from NSE too ! like pythonFoo says you need additional headers. Hower only Accept is sufficient.
The user-agent can say python ( stay true ! )