I am having a use case to get data from a specific site which needs to have requests via session every time. I have created the session in python and also cookies are set which contain my logged in details.
I am currently hosting my script on a data center but the account is getting blocked. I am thinking of requesting the data via proxy but still feel that if my session is created from a different machine and proxy is used to get data via session then what are the chances that the proxy ip is going to be black-listed?
What are the possible solutions here to cater this kind of problem.
Scenarios in Python Requests Session differs according to different regions,
Like in some countries some headers are not permitted in the request due to country laws.
Since you are already making logged in session that means your user-agent and other headers are being set according to the response from login request.
One of the solution might be using a proxy of a country which doesnot have strict rules of data extraction from that platfrom.
Related
According to the Flask Documentation, I should be able to set a cookie with a domain path for domains besides my own like this:
resp = make_request(render_template(index.html))
resp.set_cookie('cookiekey', 'cookievalue', domain='notmydomain.example.com')
I was able to set cookies for my domain with just
resp.set_cookie('cookiekey', 'cookievalue')
and they were accepted by the browser (Chrome). However, when I try to set the domain, they don't appear in the browser. Further, testing with postman reveals that the Set-Cookie headers are sent, and are correct.
Does this mean the browser is simply ignoring my request, and if so how can I get it to accept my Set-Cookie headers?
TL;DR: you can't set cookies for domains completely separate from your current domain.
Setting cookies for domains outside of your control would pose an immense security risk. The domain attribute only allows you to set cookies for either the whole domain or a subdomain. This is how, for example, a system can log you in via a subdomain such as "auth.example.org" then redirect you to "example.org".
In practice, "unified" sign-in systems are complicated: challenges are used and data might be exchanged through a backend, not relying on the browser to properly allow other subdomains to access the original cookie.
I am fairly proficient in Python and have started exploring the requests library to formulate simple HTTP requests. I have also taken a look at Sessions objects that allow me to login to a website and -using the session key- continue to interact with the website through my account.
Here comes my problem: I am trying to build a simple API in Python to perform certain actions that I would be able to do via the website. However, I do not know how certain HTTP requests need to look like in order to implement them via the requests library.
In general, when I know how to perform a task via the website, how can I identify:
the type of HTTP request (GET or POST will suffice in my case)
the URL, i.e where the resource is located on the server
the body parameters that I need to specify for the request to be successful
This has nothing to do with python, but you can use a network proxy to examine your requests.
Download a network proxy like Burpsuite
Setup your browser to route all traffic through Burpsuite (default is localhost:8080)
Deactivate packet interception (in the Proxy tab)
Browse to your target website normally
Examine the request history in Burpsuite. You will find every information you need
So I'm currently learning the python requests module but I'm a bit confused and was wondering if someone could steer me in the right direction. I've seen some people post headers when they want to log into the website, but where do they get these headers from and when do you need them? I've also seen some people say you need an authentication token, but I've seen some other solutions not even use headers or an authentication token at all. This is supposedly the authentication token but I'm not sure where to go from here after I post my username and password.
<input type="hidden" name="lt" value="LT-970332-9KawhPFuLomjRV3UQOBWs7NMUQAQX7" />
Although your question is a bit vague, I'll try to help you.
Authentication
A web browser (client) can authenticate on the target server by providing data, usually the pair login/password, which is usually encoded for security reasons.
This data can be passed from client to server using the following parts of HTTP request:
URL parameters (http://httpbin.org/get?foo=bar)
headers
body (this is where POST parameters from HTML forms usually go)
Tokens
After successful authentication server generates a unique token and sends it to client. If server wants client to store token as a cookie, it includes Set-Cookie header in its response.
A token usually represents a unique identifier of a user session. In most cases token has an expiration date for security reasons.
Web browsers usually store token as a cookie in internal cookie storage and use them in all subsequent requests to corresponding website. A single website can use multiple tokens and other cookies for a single user.
Research
Every web site has its own authentication format, rules and restrictions, so first thing you need to do is a little research on target website. You need to get information about the client sends auth information to server, what server replies and where session data is being stored (usually you can find it in client request headers).
In order to do that, you may use a proxy (Burp for example) to intercept browser traffic. It can help you to get the data passed from client to server and back.
Try to authenticate and then browse some pages on target site using your web browser with a proxy. After that, using your proxy, examine what parts of HTTP request/response do client and browser use to store information about sessions and authentication.
After that you can finally use python and requests to do what you want.
I am using google app engine's urlfetch feature to remotely log into another web service. Everything works fine on development, but when I move to production the login procedure fails. Do you have any suggestions on how to debug production URL fetch?
I am using cookies and other headers in my URL fetch (I manually set up the cookies within the header). One of the cookies is a session cookie.
There is no error or exception. On production, posting a login to the URL command returns the session cookies but when you request a page using the session cookies, they are ignored and you are prompted for login information again. On development once you get the session cookies you can access the internal pages just fine. I thought the problem was related to saving the cookies, but they look correct as the requests are nearly identical.
This is how I call it:
fetchresp = urlfetch.fetch(url=req.get_full_url(),
payload=req.get_data(),
method=method,
headers=all_headers,
allow_truncated=False,
follow_redirects=False,
deadline=10
)
Here are some guesses as to the problem:
The distributed nature of google's url fetch implementation is messing things up.
On production, headers are sent in a different order than in development, perhaps confusing the server.
Some of google's servers are blacklisted by the target server.
Here are some hypothesis that I've ruled out:
Google caching is too aggressive. But I still get the problem after turning off cache by using the header Cache-Control: no-store.
Google's urlfetch is too fast for the target server. But I still get the problem after inserting delays between calls.
Google appends some data to the User-Agent header. But I have added that header to development and I don't get the problem.
What other differences are there between the production URL fetch and the development URL fetch? Do you have any ideas for debugging this?
UPDATE 2
(First update was incorporated above)
I don't know if it was something I did (maybe adding delays or disabling caches mentioned above) but now the production environment works about 50% of the time. This definitely looks like a race condition. Unfortunately, I have no idea if the problem is in my code, google's code, or the target server's code.
As others have mentioned, the key differences between dev and prod are the originating IP, and how some of the request headers are handled. See here for a list of restricted headers. I don't know if this is documented, but in prod, your app ID is appended to the end of your user agent. I had an issue once where requests in prod only were getting detected as a search engine spider because my app ID contained the string "bot".
You mentioned that you're setting up cookies manually, including the session cookie. Does this mean that you established a session in Dev, and then you're trying to re-use it in prod? Is it possible that the remote server is logging the source IP that establishes a session, and requiring that subsequent requests come from the same IP?
You said that it doesn't work, but you don't get an exception. What exactly does this mean? You get an HTTP 200 and an empty response body? Another HTTP status? Your best bet may be to contact the owners of the remote service and see if they can tell you more specifically what was wrong with your request. Anything else is just speculation.
Check your server's logs to see if GAE is chopping any headers off. I've noticed that GAE (thought I think I've seen it on the dev server) will chop off headers it doesn't like.
Depending on the web service you're calling, it might also be less ok with GAE calling it than your local machine.
I ran across this issue while making a webapp with an analogous issue- when looking at urlfetch's documentation, it turns out that the maximum timeout for a fetch call is 60 seconds, but it defaults to 5 seconds.
5 seconds on my local machine was long enough to request URLs on my local machine, but on GAE it was only consistently completing its task in 5 seconds about 20% of the time.
I included the parameter deadline=60 and it has been working fine since.
Hope this helps others!
I have a series of different domain names that I would like to all point (via URL forwarding from my domain host) to a google app engine application that reads what the forwarding URL is. So if the domain typed in was original XYZ.com, then when I am forwarded to my application, I can return what that original domain name was. I'm using the python variant. How best can I do this without coding for each and every variant?
Generally, the target of a 301/302 redirect can't determine what URL issued the redirect. If the user is redirected by client-side code, the referring page should be present in the "Referer" request header. For server-side redirects, I don't believe it's standard for user agents to populate (or override) the Referer header.
If you want to point multiple domains to your App Engine app, try configuring them as custom domains rather than forwards. With this route, the custom domain would stay in the user's address bar, and you can simply check the host header to see which custom domain the visitor is using.
If by forwarding you mean HTTP redirection, you can check the Referer header.
If you mean DNS resolving (e.g. distinguishing between your application being invoked via your own domain and .appspot.com one), there is SERVER_NAME environment variable (os.environ["SERVER_NAME"]) that stores the domain (e.g. www.example.com) used to issue the request.
If you are using a javascript redirect, that you can just check the referrer:
http://en.wikipedia.org/wiki/HTTP_referrer
but this is not a 100% solution.
If you have multiple domains parked on one appengine instance and you just want to know which one the user is viewing, you can check the host name in the request object.