I am programming an application in Python that, among other functions, will print PDF files via a Xerox printer.
I am facing two options right now:
First one: find a way to comunicate with the printer driver so I could easily send instructions to the printer and do whatever I wanted with it.
Second one: since the first one seems to be a bit tricky and I don't know any API for Python that does something like that, I had the idea of using the service that Xerox provides. Basically there is an IP address that redirects me to an administration page where I can get informations about the status of the printer and... an option to select files to print (and set the number of pages, the tray where the pages will exit, etc...).
I think the best way is to follow the second option, but I don't know if that's doable.
Basically I want to be able to change that webpage source code in order to change, for example the textboxes and in the end "press" the submit button.
I don't know if this is possible, but if it is, can anyone point me in the right path, please?
Or if you have another idea, I would like to hear it.
By now I only managed to get the page source code, I still don't know how to submit it after I change it.
import requests
url = 'http://www.example.com'
response = requests.get(url)
print(response.content)
Unless Xerox has a Python API or library, your second option is the best choice.
When you visit the administration page and submit files for printing, try doing the following:
When you load the admin page, open Chromes developer tools (right click -> Inspect Element)
Open the "Network" tab in the developer console.
Try submitting some files for printing through the online form. Watch the Network panel for any activity. If a new row appears, click on it and view the request data.
Try to replicate the request's query parameters and HEAD parameters with Python's requests.
If you need any help replicating the exact request, feel free to start a new question with the request data and what you have tried.
Related
I am trying to extract the data from the website https://shop.nordstrom.com/ for all the products (like shirt, t-shirt and so on). The page is dynamically loaded. I know I can use selenium with headless browser, but that is also a time consuming process and looking up on the elements, having strange ID and class names, that is also not so promising.
So I thought of looking up on the Network tool, if I can find the path to the API, from where the data is being loaded (XHR Request) . But I could not find any thing helpful. So is there a way to get the data from the website ?
If you don't want to use selenium then the alternative is to use a web parser like bs4 or use simply the request module.
You are on the right path in finding the call to the API. XHR requests can be seen under the network tab but the multitude of resources that appears makes it intricate to understand the requests being made. A simple way around this is to use the following method:
Instead of Network tab go to the console tab. There click on the settings icon, and then tick just the option Log XMLHTTPRequests.
Now refresh the page and scroll down to initiate dynamic calls. You will now be able to see the logs of all XHR in a more clear way.
For example
(index):29 Fetch finished loading: GET "**https://shop.nordstrom.com/api/recs?page_type=home&placement=HP_SALE%2CHP_TOP_RECS%2CHP_CUST_HIS%2CHP_AFF_BRAND%2CHP_FTR&channel=web&bound=24%2C24%2C24%2C24%2C6&apikey=9df15975b8cb98f775942f3b0d614157&session_id=0&shopper_id=df0fdb2bb2cf4965a344452cb42ce560&country_code=US&experiment_id=945b2363-c75d-4950-b255-194803a3ee2a&category_id=2375500&style_id=0%2C0%2C0%2C0&ts=1593768329863&url=https%3A%2F%2Fshop.nordstrom.com%2F&zip_code=null**".
Making a get request to that URL gives a bunch of Json objects. You can now use this url and others that you can derive to make the request straight to the URL.
See the answer here on how you can integrate the url with a request module to fetch data.
I have been using mechanize to fill in a form from a website but this now has changed and some of the required fields seem to be hidden and cannot be accessed using mechanize any longer - when printing all available forms.
I assume it has been modified to use more current methods (application/x-www-form-urlencoded) but I have not found a way to update my script to continue using this form programmatically.
From what I have read, I should be able to send a dict (key/value pair) to the submit button directly rather than filling the form in the first place - please correct me if I am wrong.
BUT I have not been able to find a way to obtain what keys are required...
I would massively appreciate it if someone could point me in the right direction or put me straight in case this is no longer possible.
You cannot, in all circumstances, extract all fields a server expects.
The post target, the code handling the POST, is a black box. You cannot look inside the code that the server runs. The best information you have about what it expects is what the original form tells your browser to post. That original form consists not only of the HTML, but also of the headers that were sent with it (cookies for example) and any JavaScript code that is run by the browser.
In many cases, parsing the HTML sent for the form is enough; that's what Mechanize (or a recent more modern framework like robobrowser) does, plus a little cookie handling and making sure typical headers such as the referrer are included. But if any JavaScript code manipulated the HTML or intercepts the form submission to add or remove data then Mechanize or other Python form parsers cannot replicate that step.
Your options then are to:
Reverse engineer what the Javascript code does and replicate that in Python code. The development tools of your browser can help here; observe what is being posted on the network tab, for example, or use the debugger to step through the JavaScript code to see what it does.
Use an actual browser, controlled from Python. Selenium could do this for you; it can drive a desktop browser (Chrome, Firefox, etc.) or it can be used to drive a headless browser implementation such as PhantomJS. This is heavier on the resources, but will actually run the JavaScript code and let you post a form just as your browser would, in each and every way.
I have been using mechanize to fill in a form from a website but this now has changed and some of the required fields seem to be hidden and cannot be accessed using mechanize any longer - when printing all available forms.
I assume it has been modified to use more current methods (application/x-www-form-urlencoded) but I have not found a way to update my script to continue using this form programmatically.
From what I have read, I should be able to send a dict (key/value pair) to the submit button directly rather than filling the form in the first place - please correct me if I am wrong.
BUT I have not been able to find a way to obtain what keys are required...
I would massively appreciate it if someone could point me in the right direction or put me straight in case this is no longer possible.
You cannot, in all circumstances, extract all fields a server expects.
The post target, the code handling the POST, is a black box. You cannot look inside the code that the server runs. The best information you have about what it expects is what the original form tells your browser to post. That original form consists not only of the HTML, but also of the headers that were sent with it (cookies for example) and any JavaScript code that is run by the browser.
In many cases, parsing the HTML sent for the form is enough; that's what Mechanize (or a recent more modern framework like robobrowser) does, plus a little cookie handling and making sure typical headers such as the referrer are included. But if any JavaScript code manipulated the HTML or intercepts the form submission to add or remove data then Mechanize or other Python form parsers cannot replicate that step.
Your options then are to:
Reverse engineer what the Javascript code does and replicate that in Python code. The development tools of your browser can help here; observe what is being posted on the network tab, for example, or use the debugger to step through the JavaScript code to see what it does.
Use an actual browser, controlled from Python. Selenium could do this for you; it can drive a desktop browser (Chrome, Firefox, etc.) or it can be used to drive a headless browser implementation such as PhantomJS. This is heavier on the resources, but will actually run the JavaScript code and let you post a form just as your browser would, in each and every way.
I want to be able to access the elements of a webpage with python. I use Python 2.5 with Windows XP. Currently, I am using pywinauto to try to control IE, but I could switch to a different browser or module if it is needed. I need to be able to click the buttons the page has and type in text boxes. So far the best I've gotten is I was able to bring IE to the page I wanted. If it is possible without actually clicking the coords of the buttons and text boxes on the page, please tell me. Thanks!
I think for interacting with webserver better is to use cUrl. All webservers function are responses for GET or POST request (or both). in order to call them, just call the urls that buttons are linked to and/or send POST data attaching that data to appropiate request obj before calling send method. cUrl is able to retrieve and do some processing on webserver responce (web site code) without displaying it, what delivers knowledge about url adresses contained in the web site, which are called when clicking certain buttons. Also possible to know html fields which carry POST data to get their names.
Python has curl lib, which I hope is so powerful as PHP curl, tool what I used and presented.
Hope you are on the track now.
BR
Marek
Tommy ,
What i Have observer till now if u have a web application and if u are able to identify the object in that page using winspy (tool for spyup the controls in the page). you can automate that. Else you cant.
Like using following code
app = Application.Application();
app.window_(title_re = 'Internet Explorer. *').OKButton.Click()
I'd like to retrieve data from a specific webpage by using urllib library.
The problem is that in order to open this page some data should be sent to
the server before. If I do it with IE, i need to update first some checkboxes and
then press "display data" button, which opens the desired page.
Looking into the source code, I see that pressing "display data" submits some kind of
form - there is no specific url address there. I cannot figure out by looking
at the code what paramaters are sent to the server...
I think that maybe the simpler way to do that would be to analyze the communication
between the IE and the webserver after pressing the "display data" button.
If I could see explicitly what IE does, I could mimic it with urllib.
What is the easiest way to do that?
An HTML debugging proxy would be the best tool to use in this situation. As you're using IE, I recommend Fiddler, as it is developed by Microsoft and automatically integrates with Internet Explorer through a plugin. I personally use Fiddler all the time, and it is a really helpful tool, as I'm building an app that mimics a user's browsing session with a website. Fiddler has really good debugging of request parameters, responses, and can even decode encrypted packets.
You can use a web debugging proxy (e.g. Fiddler, Charles) or a browser addon (e.g. HttpFox, TamperData) or a packet sniffer (e.g. Wireshark).