I'd like to retrieve data from a specific webpage by using urllib library.
The problem is that in order to open this page some data should be sent to
the server before. If I do it with IE, i need to update first some checkboxes and
then press "display data" button, which opens the desired page.
Looking into the source code, I see that pressing "display data" submits some kind of
form - there is no specific url address there. I cannot figure out by looking
at the code what paramaters are sent to the server...
I think that maybe the simpler way to do that would be to analyze the communication
between the IE and the webserver after pressing the "display data" button.
If I could see explicitly what IE does, I could mimic it with urllib.
What is the easiest way to do that?
An HTML debugging proxy would be the best tool to use in this situation. As you're using IE, I recommend Fiddler, as it is developed by Microsoft and automatically integrates with Internet Explorer through a plugin. I personally use Fiddler all the time, and it is a really helpful tool, as I'm building an app that mimics a user's browsing session with a website. Fiddler has really good debugging of request parameters, responses, and can even decode encrypted packets.
You can use a web debugging proxy (e.g. Fiddler, Charles) or a browser addon (e.g. HttpFox, TamperData) or a packet sniffer (e.g. Wireshark).
Related
I have been playing with requests module on Python for a while as part of studying HTTP requests/responses; and I think I grasped most of the fundamental things on the topic that are supposed to be understood. With a naive analogy it basically works on ping-pong principle. You send a request in a packet to server and then it send back to you another packet. For instance, logging in to a site is simply sending a post request to server, I managed to do that. However, what I have trouble is to fail clicking on buttons through HTTP post request. I searched for it here and there, but I could not find a valid answer to my inquiry other than utilizing selenium module, which is what I do not want to if there is another way with requests module too. I am also aware of the fact that they created such a module called selenium for a thing.
QUESTIONS:
1) What kind of parameters do I have to take into account for being able to click on buttons or links from the account I accessed through HTTP requests? For instance, when I watch network activity for request header and response header with my browser's built-in inspect tool, I get so many parameters sent back by server, e.g. sec-fetch-dest, sec-fetch-mode, etc.
2) Is it too complicated for a beginner or is there too much advanced stuff going on behind the scene to do that so selenium was created for that reason?
Theoretically, you could write a program to do this with requests, but you would be duplicating much of the functionality that is already built and optimized in other tools and APIs. The general process would be:
Load the HTML that is normally rendered in your browser using a get request.
Process the HTML to find the button in question.
Then, if it's a simple form:
Determine the request method the button will carry out (e.g. using the formmethod argument, see here).
Perform the specified request with the required information in your request packet.
If it's a complex page (i.e. it uses JavaScript):
Find the button's unique identifier.
Process the JavaScript code to determine what action is performed when the button is clicked.
If possible, perform the JavaScript action using requests (e.g. following a link or something like that). I say if possible because JavaScript can do many things that, to my knowledge, simple HTTP request cannot, like changing rendered CSS in order to change the background color of a <div> when a button is clicked.
You are much better off using a tool like selenium or beautiful soup, as they have created APIs that do a lot of the above for you. If you've used the built-in requests library to learn about the basic HTTP request types and how they work, awesome--now move on to the plethora of excellent tools that wrap requests up into a more functional and robust API.
I want to write a python script for a website that requires a login to enable some features and I want to find out, what I need to put in the header of my script requests (e.g. authentication token and other parameters), so they are executed the same way as requests over the browser.
Does wireshark help with this if the website uses HTTPS?
Or is my only option executing a browser script with Selenium after a manual login?
For anyone else with the same issue: you don't need to traffic your traffic from outside the browser. Just...
use Google Chrome
open developer tools
click on the Network Tab
clear the data
and do a request in the tab where the dev-tools are
open
You should see the initial request at the top followed by
subsequent ones (advertising, external image-server etc).
You can
rightclick the initial request, save it as a .har-file and use
something like https://toolbox.googleapps.com/apps/har_analyzer/ to
extract the headers of both or the request and the response.
Now you know what parameters (key and value) you need in your header and can even use submitted values like tokens and cookies in your python script
From this question, the last responder seems to think that it is possible to use python to open a webpage, let me sign in manually, go through a bunch of menus then let the python parse the page when I get where I want. The website has a weird sign in procedure so using requests and passing a user name and password will not be sufficient.
However it seems from this question that it's not a possibility.
SO the question is, is it possible? if so, do you know of some example code out there?
The way to approach this problem is when you login normally have the developer tools next to you and see what the request is sending.
When logging in to bandcamp the XHR request that's being sent is the following:
From that response you can see that an identity cookie is being sent. That's probably how they identify that you are logged in. So when you've got that cookie set you would be authorized to view logged in pages.
So in your program you could login normally using requests, save the cookie in a variable and then apply the cookie to further requests using requests.
Of course login procedures and how this authorization mechanism works may differ, but that's the general gist of it.
So when do you actually need selenium? You need it if a lot of the things are being rendered by javascript. requests is only able to get the html. So if the menus and such is rendered with javascript you won't ever be able to see that information using requests.
I am programming an application in Python that, among other functions, will print PDF files via a Xerox printer.
I am facing two options right now:
First one: find a way to comunicate with the printer driver so I could easily send instructions to the printer and do whatever I wanted with it.
Second one: since the first one seems to be a bit tricky and I don't know any API for Python that does something like that, I had the idea of using the service that Xerox provides. Basically there is an IP address that redirects me to an administration page where I can get informations about the status of the printer and... an option to select files to print (and set the number of pages, the tray where the pages will exit, etc...).
I think the best way is to follow the second option, but I don't know if that's doable.
Basically I want to be able to change that webpage source code in order to change, for example the textboxes and in the end "press" the submit button.
I don't know if this is possible, but if it is, can anyone point me in the right path, please?
Or if you have another idea, I would like to hear it.
By now I only managed to get the page source code, I still don't know how to submit it after I change it.
import requests
url = 'http://www.example.com'
response = requests.get(url)
print(response.content)
Unless Xerox has a Python API or library, your second option is the best choice.
When you visit the administration page and submit files for printing, try doing the following:
When you load the admin page, open Chromes developer tools (right click -> Inspect Element)
Open the "Network" tab in the developer console.
Try submitting some files for printing through the online form. Watch the Network panel for any activity. If a new row appears, click on it and view the request data.
Try to replicate the request's query parameters and HEAD parameters with Python's requests.
If you need any help replicating the exact request, feel free to start a new question with the request data and what you have tried.
Is it possible for my python web app to provide an option the for user to automatically send jobs to the locally connected printer? Or will the user always have to use the browser to manually print out everything.
If your Python webapp is running inside a browser on the client machine, I don't see any other way than manually for the user.
Some workarounds you might want to investigate:
if you web app is installed on the client machine, you will be able to connect directly to the printer, as you have access to the underlying OS system.
you could potentially create a plugin that can be installed on the browser that does this for him, but I have no clue as how this works technically.
what is it that you want to print ? You could generate a pdf that contains everything that the user needs to print, in one go ?
You can serve to the user's browser a webpage that includes the necessary Javascript code to perform the printing if the user clicks to request it, as shown for example here (a pretty dated article, but the key idea of using Javascript to call window.print has not changed, and the article has some useful suggestions, e.g. on making a printer-friendly page; you can locate lots of other articles mentioning window.print with a web search, if you wish).
Calling window.print (from the Javascript part of the page that your Python server-side code will serve) will actually (in all browsers/OSs I know) bring up a print dialog, so the user gets system-appropriate options (picking a printer if he has several, maybe saving as PDF instead of doing an actual print if his system supports that, etc, etc).