I have a python script that needs to download a csv file from:
https://myasx.asx.com.au/home/watchlist/download.do
The issue I have is you have to log in to the website first, It is Cookie based authentication (HTML form login).
So far I have looked at urllib2 and Requests and haven't had much luck.
The requests library should do what you want. You can use Session objects to persist the authentication.
To quote the request docs -
The Session object allows you to persist certain parameters across requests. It also persists cookies across all requests made from the Session instance.
Post your code if you are still experiencing problems.
Related
i am confused on this particular topic, i built a bot for two different websites making use of python's requests module to manually simulate the sending of HTTP PoST and GET requests.
I implemented socks proxies and also used user agents in my requests as well as referrer URL;s when neccesary (i verified actual requests sent by a browser when on these sites using burpsuite) in order to make it look genuine.
However, any accounts i run through my bots keep getting suspended. It got me wondering what i'm doing wrong, a friend suggested that maybe i should use one of these headless solutions(phantomJS) and i am leaning towards that route but i am still confused and would like to know what the difference is between using HTTP requests module and using headless browser like phantomJS.
I am not sure if there is any need to paste my source code here. Just looking for some direction on this project. thank you for taking your time to read such a long wall of text :)
Probably, you have to set cookies.
To make your requests more genuine, you should set other headers such as Host and Referer. However, the Cookies header should change every time. You can get them in this way:
from requests import Session
with Session() as session:
# Send request to get cookies.
response = session.get('your_url', headers=your_headers, proxies=proxies) # eventually add params keyword
cookies = response.cookies.get_dict()
response = session.get('your_url', headers=your_headers, cookies=cookies, proxy=proxy)
Or maybe, the site is scanning for bots in some way.
In this case, you could try to add a delay between requests with time.sleep(). You can see timings in Dev Tools on your browser. Alternatively, you could emulate all the requests you send when you connect to the site on your browser, such as ajax scripts, etc.
In my experience, using requests or using Selenium webdrivers doesn't make much difference in terms of detection, because you can't access headers and even request and response data. Also, note that Phantom Js is no longer supported. It's preferred to use headless Chrome instead.
If none of requests approach doesn't work, I suggest using Selenium-wire or Mobilenium, modified versions of Selenium, that allow accessing requests and response data.
Hope it helps.
i want to access browser name and version in python by sending out a request.is this the ideal method or is there any other way? because all the methods which provide user agents give PythonUserlib2.7 as user agent,i want my actual user agent.
I'll assume you're familiar with HTTP requests and their structure, if not, here's a link to the RFC documentation for HTTP/1.1 requests, at the bottom of the page, there is a list of links to header fields.
The user-agent is a field in the HTTP request header that identifies the entity that sends the request, by entity I mean the program you used to send the request, that's hosted on your machine. Things like the browser type, version and operating system, are sent in the user-agent field.
So, when you use urllib.request to send a request, urllib fills the HTTP request headers with the values you provide to it, otherwise, default values are used. That's why you get PythonUserLib2.7 as a user-agent.
If you need the user-agent of a specific browser, you need to send request using that browser, you can do that in python by using a browser automation tool, like selenium webdriver. Which you can use to launch an instance of your browser, and go to websites.
I've worked only with selenium webdriver, and it doesn't have the capability to inspect sent/received packets/requests, in other words, you can't get HTTP requests/responses directly from selenium.
As a work around, you can use selenium(or any other automation tool) to launch your browser, then go to a website that will give you your user-agent, and may even parse it.
Here's a link to selenium documentation, it explains how to get started with selenium and how to download the required packages.
If you search on google for user-agent online, Google will tell you what's your user agent.
I am developing a WSGI middleware application (Python 2.7) using Werkzeug. This app works within a SAML SSO environment and needs a SAML token to be accessed.
The middleware also performs requests to other applications in the same SAML environment, acting on behalf of the logged in user. In order to do that without the need of user feedback, I need to forward the SAML session cookie that I can get from the WSGI environment to requests that I am performing using the Requests library.
My issue is that the cookies that I get from WSGI/Werkzeug can only be parsed as http.cookies.SimpleCooke , while Requests accepts cookielib.CookieJar instances.
I have not found a way to cleanly forward these session cookies without resorting to shameful hacks such as parsing the raw content of the set-cookie headers.
Any suggestions?
Thanks,
gm
Cookies are just HTTP headers. Just use pull the cookie value from http.cookies.SimpleCookie, and add it to your requests session's cookie jar.
Not a hack. :)
In addition to sending cookies with my python requests request I would also like to send a localstorage key value pair.
I tried looking at the requests docs and it does not look like it is capable in doing this, is it?
Is there a way for me to do this without using a headless browser?
Local storage is not sent with requests. It is accessible via javascript through client-side code. This website has more information about the differences between local storage and cookies.
I have an application with many users, some of these users have an account on an external website with data I want to scrape.
This external site has a members area protected with a email/password form. This sets some cookies when submitted (a couple of ASP ones). You can then pull up the needed page and grab the data the external site holds for the user that just logged in.
The external site has no API.
I envisage my application asking users for their credentials to the external site, logging in on their behalf and grabbing the data we want.
How would I go about this in Python, i.e. do I need to run a GUI web browser on the server that Python prods to handle the cookies (I'd rather not)?
Find the call the page makes to the backend by inspecting what is the format of the login call in your browser's inspector.
Make the same request after using either getpass to get user credentials from the terminal or via a GUI. You can use urllib2 to make the requests.
Save all the cookies from the response in a cookiejar.
Reuse the cookies in subsequent requests and fetch data.
Then, profit.
Usually, this is performed with session.
I'm recommending you to use requests library (http://docs.python-requests.org/en/latest/) in order to do that.
You can use the Session feature (http://docs.python-requests.org/en/latest/user/advanced/#session-objects). Simply perform an authentication HTTP request (url and parameters depends of the website you want to request), and then, perform a request towards the ressource you want to scrape.
Without further information, we cannot help you more.