Need to input text into a POST website using python - python

I need to input text into the text boxon this website:
http://www.link.cs.cmu.edu/link/submit-sentence-4.html
I then require the return page's html to be returned.
I have looked at other solutions. But i am aware that there is no solution for all.
I have seen selenium, but im do not understand its documentation and how i can apply it.
Please help me out thanks.
BTW i have some experience with beautifulsoup, if it helps.

Check out the requests module. It is super easy to use to make any kind of HTTP request and gives you complete control of any extra headers or form completion payload data you would need to POST data to the website you want to.
P.S. If all else fails, make the request you want to make to the website in a web browser and get the curl address of the request using an inspector. Then you can just start a python script and exec the curl command (which you might need to install on your system if you dont have it) with the parameters in the curl request you copied

Related

Using python to parse a webpage that is already open

From this question, the last responder seems to think that it is possible to use python to open a webpage, let me sign in manually, go through a bunch of menus then let the python parse the page when I get where I want. The website has a weird sign in procedure so using requests and passing a user name and password will not be sufficient.
However it seems from this question that it's not a possibility.
SO the question is, is it possible? if so, do you know of some example code out there?
The way to approach this problem is when you login normally have the developer tools next to you and see what the request is sending.
When logging in to bandcamp the XHR request that's being sent is the following:
From that response you can see that an identity cookie is being sent. That's probably how they identify that you are logged in. So when you've got that cookie set you would be authorized to view logged in pages.
So in your program you could login normally using requests, save the cookie in a variable and then apply the cookie to further requests using requests.
Of course login procedures and how this authorization mechanism works may differ, but that's the general gist of it.
So when do you actually need selenium? You need it if a lot of the things are being rendered by javascript. requests is only able to get the html. So if the menus and such is rendered with javascript you won't ever be able to see that information using requests.

Python Webscraper Breaking, Not sure Why

I am trying to access a 3rd party ticketing site via API through a web scraper.
I know this is vague but I am new to python and I am not exactly sure how to figure out my error below:
My code breaks on this line:
roken_response =r.json
I get this error
Can anyone tell why exactly my code is breaking?
Using the requests library (which you seem to be using), .json is a convenience method that decodes the response as JSON. If your response was not JSON, then you will get a JSONDecodeError, as you show in your screenshot.
So the webserver probably answered your request with some HTML or something instead of JSON.
Also it sounds like you are violating the ToS of that poor ticketing site :(

Direct link to comments that are being loaded asynchronously?

I am playing around with change.org and trying to download a couple of comments on a petition. For this, I would like to know where the comments are being pulled from when the user clicks on "load more reasons" For an example, look here:
http://www.change.org/petitions/tell-usda-to-stop-using-pink-slime-in-school-food
Looking at the XHR requests in Chrome, I see requests being sent to http://www.change.org/petitions/tell-usda-to-stop-using-pink-slime-in-school-food/opinions?page=2&role=comments Of course, the page number varies with the number of times comments are being loaded.
However, this link leads to a blank page when I try it in a browser. Is this because of some missing data in the url or is this a result of some authentication step within the javascript that makes the request in the first place?
Any pointers will be appreciated. Thanks!
EDIT: Thanks to the first response, I see that the data is being received when I use the console. How do I receive the same data when making the request from a python script. Do I have to mimic the browser or is there a way to just use urllib?
They must be validating the source of the request. If you go to the site open the console and run this:
$.get('http://www.change.org/petitions/tell-usda-to-stop-using-pink-slime-in-school-food/opinions?page=2&role=comments',{},function(data){console.log(data);});
You will see the data come back

Parse and interact with obfuscated javascript

I'm trying to interact with a HTML 4.0 website which uses heavily obfuscated javascript to hide the regular HTML elements. What I want to do is to fill out a form and read the returned results, and this is proving harder to do than expected.
When I read the page using Firebug, it gave me the source code deobfuscated, and I can then use this to do what I want to accomplish. The Firebug output showed all the regular elements of a website, such as -tags and the like, which were hidden in the original source.
I've written the rest of my application in Python, using mechanize to interact with other web services, so I'd rather use an existing Python module to do this if that's possible. The problem is not only how to read the source code in a way mechanize can understand, but also how to generate the response which the web server can interpret. Could I use regular mechanize controls even though the html code is obfuscated?
In the beginning of my project I used pywebkitgtk instead of mechanize, but ditched it because it wasn't really implemented that well in python. Most functions are missing. Would that be a sensible method perhaps, to start up a webkit-browser which I read the HTML from, and use that with mechanize?
Any help would be greatly appreciated, I'm really in a bind here. Thanks!
Edit: I tried dumping the HTML fetched from mechanize and opening that with pywebkitgtk, using load_html_string, and then evaluating the html that way. Unfortunately, since the document I'm trying to parse loads more resources dynamically, that scripts just stops waiting for resources to be loaded. Note that I can't use webkit to load the document itself since I use mechanize's CookieJar function to allow me to log in first.
I also tried dumping the HTML from webkit, which for some reason dumped the obfuscated javascript only, while displaying the website perfectly fine. If webkit could dump the deobfuscated javascript the way Firebug does, I could work with that and form a request according to the clean code..
Rather than trying to process the page, how about just use Firebug to figure out the names of the form fields, and then use httplib or whatever to send a request with the necessary fields and settings?
If it's sent using ajax, you should be able to determine the values being sent to the server in Firebug as well.

Retrieve cookie created using javascript in python

I've had a look at many tutorials regarding cookiejar, but my problem is that the webpage that i want to scape creates the cookie using javascript and I can't seem to retrieve the cookie. Does anybody have a solution to this problem?
If all pages have the same JavaScript then maybe you could parse the HTML to find that piece of code, and from that get the value the cookie would be set to?
That would make your scraping quite vulnerable to changes in the third party website, but that's most often the case while scraping. (Please bear in mind that the third-party website owner may not like that you're getting the content this way.)
I responded to your other question as well: take a look at mechanize. It's probably the most fully featured scraping module I know: if the cookie is sent, then I'm sure you can get to it with this module.
Maybe you can execute the JavaScript code in a JavaScript engine with Python bindings (like python-spidermonkey or pyv8) and then retrieve the cookie. Or, as the javascript code is executed client side anyway, you may be able to convert the cookie-generating code to Python.
You could access the page using a real browser, via PAMIE, win32com or similar, then the JavaScript will be running in its native environment.

Categories

Resources