Based on some quick examples found on SO and other sources, I am trying to use Python urllib/urllib2 to submit a form in the following manner:
>>> import urllib, urllib2
>>> url = 'http://example.com'
>>> r_params = {'a':'test','b':'hooray'}
>>> e_params = urllib.urlencode(r_params)
>>> user_agent = 'some browser and such'
>>> headers = {'User-Agent': user_agent}
>>> req = urllib2.Request(url, e_params, headers)
>>> response = urllib2.urlopen(req)
>>> data = response.read()
I've gotten this to work, however, on the particular form I am looking for there are two buttons of type "submit". e.g.:
<b><input type="submit" name="ButtonA" value="SUBMIT"></b>
<b><input type="submit" name="ButtonB" value="LINK"></b>
I believe the problem I'm having results from the current code choosing the wrong one. How do I get a response by submitting ButtonB rather than ButtonA? Some of the stuff I've read seems to indicate that I could try using mechanize, but I was hoping to keep this simple without having to read up and learn mechanize. Is there an easy way to do this, or do I need to suck it up and actually take the time to learn and understand what I'm doing?
It should be fairly simple, if that's the case - you should look in to what exactly you're doing. Specifically, you're sending a POST request (urllib2.urlopen will send a POST request automatically if the data argument is supplied) with the data that would normally be supplied by the form element itself. In the case of multiple "submit" inputs, the name and value of the activated submit input is sent along with the rest of the form data.
So, that's all you have to do - include "ButtonB":"LINK" as data.
A quick reference so you can see how HTML does all the stuff it does:
http://www.w3.org/TR/html401/interact/forms.html#submit-format
I recommend using a tool like TamperData for Firefox to discover precisely how the site's POSTs are formed. Activate TamperData just before you're ready to click one of the buttons. When it's up, go ahead and click one. The POST will be recorded in TamperData. Find it and click on it.
Find the POSTDATA row below and double-click it. Select the "Decoded" radio button to remove the HTML escapes. Now you have a 1:1 reference you should copy when making your "r_params" dictionary. For instance, if the POSTDATA looked like this:
Name | Value
--------------------
QueryString | test
Page |
Search | blah
then you will create your dictionary like this:
r_params = {'QueryString': 'test',
'Page': '',
'Search':, 'blah'}
After you've found out what the POSTDATA looks like for each separate submit event, you'll know how to create the right dictionary to send along. Also, be sure to confirm you are POSTing to the correct URL. Good luck!
Related
Here is my example form: https://docs.google.com/forms/d/e/1FAIpQLSfVXZ1721ZRrHetp1qUak9T-o-MwKA9G3q01rLAFI2OJhZjUw/viewform
I want to send a response to it with python, but I don't know how to fill the "text box", so I can't even start it. Can you help me, please?
For submitting data to google form you first need to replace viewform to formResponse in your url.
You are going to POST submission to the form response URL.
You need to keep 2 things in mind.
Get the form response URL. It can be found by replacing your form ID into the following:
https://docs.google.com/forms/d/<form_id>/formResponse
Assemble the submission. This will be a dictionary reference with keys being the IDs of the form questions and the values being what you'd like to submit. To get the IDs, again go to your live form and inspect the html (Right Click -> Inspect Elements) components where you would typically input your information. You should discover a group of ss-structure passage objects with name attribute like:
name="entry.<id>"
A simple program to send response would be:
import requests
url ="https://docs.google.com/forms/d/e/1FAIpQLSfVXZ1721ZRrHetp1qUak9T-o-MwKA9G3q01rLAFI2OJhZjUw/formResponse"
data_to_send = 'DATA' # Assign Data to be sent
requests.post(url, {"entry.685250623":data_to_send}) # Found the entry Id viewing your form
Hope this answers your question!!!
I'm a Python novice, thanks for your patience.
I retrieved a web page, using the requests module. I used Beautiful Soup to harvest a few hundred href objects (links). I used uritools to create an array of full URLs for the target pages I want to download.
I don't want everybody who reads this note to bombard the web server with requests, so I'll show a hypothetical example that is realistic for just 2 hrefs. The array looks like this:
hrefs2 = ['http://ku.edu/pls/WP040?PT001F01=910&pf7331=11',
'http://ku.edu/pls/WP040?PT001F01=910&pf7331=12']
If I were typing these into 100s of lines of code, I understand what to do in order to retrieve each page:
from lxml import html
import requests
url = 'http://ku.edu/pls/WP040/'
payload = {'PT001F01' : '910', 'pf7331' : '11')
r = requests.get(url, params = payload)
Then get the second page
payload = {'PT001F01' : '910', 'pf7331' : '12')
r = requests.get(url, params = payload)
And keep typing in payload objects. Not all of the hrefs I'm dealing with are sequential, not all of the payloads are different simply in the last integer.
I want to automate this and I don't see how to create the payloads from the hrefs2 array.
While fiddling with uritools, I find urisplit which can give me the part I need to parse into a payload:
[urisplit(x)[3] for x in hrefs2]
['PT001F01=910&pf7331=11',
'PT001F01=910&pf7331=12']
Each one of those has to be turned into a payload object and I don't understand what to do.
I'm using Python3 and I used uritools because that appears to be the standards-compliant replacement of urltools.
I fell back on shell script to get pages with wget, which does work, but it is so un-Python-ish that I'm asking here for what to do. I mean, this does work:
import subprocess
for i in hrefs2:
subprocess.call(["wget", i])
You can pass the full url to requests.get() without splitting up the parameters.
>>> requests.get('http://ku.edu/pls/WP040?PT001F01=910&pf7331=12')
<Response [200]>
If for some reason you don't want to do that, you'll need to split up the parameters some how. I'm sure there are better ways to do it, but the first thing that comes to mind is:
a = ['PT001F01=910&pf7331=11',
'PT001F01=910&pf7331=12']
# list to store all url parameters after they're converted to dicts
urldata = []
#iterate over list of params
for param in a:
data = {}
# split the string into key value pairs
for kv in param.split('&'):
# split the pairs up
b = kv.split('=')
# first part is the key, second is the value
data[b[0]] = b[1]
# After converting every kv pair in the parameter, add the result to a list.
urldata.append(data)
You could do this with less code but I wanted to be clear what was going on. I'm sure there is already a module somewhere out there that does this for you too.
As you can probably tell from the nature of my question, I'm a little new to this. I have read similar post on this subject matter but most of it went right past my head and I did not feel like it was 100% applicable to the circumstance that I was facing so I thought I'd ask the question in a simplified way.
The question:
let's say I'm running the below HTMl form and a user submits the form to my views.py as shown in the views section below, I would able to store the value of the user selection by using: car_selection = request.POST.get('car') .
My question is, how would I be able to capture the HTML5 data of " data-animal-type="spider" " ?
I know there are Gurus out there but please do not explode my head. I would really need simplified help.
Thanks for helping.
Example HTML Form:
<select name="carlist" >
option data-car-type="premium" name= "car" value="audi">Audi</option>
</select>
Example Django View Function
def getcar(request):
...
if request.method == 'POST'
...
selected_carn = request.POST.get('car')
Well, it actually is possible. Say your view looks like this:
def getcar(request):
...
if request.method == 'POST'
myform = MyForm(request.POST)
...
myform includes uncleaned form in html. The you can use BeautifulSoup to extract data. Something like this:
from bs4 import BeautifulSoup
test = BeautifulSoup(str(myform))
data-values = [item["data-car-type"] for item in test.find_all() if "data-car-type" in item.attrs]
This will extract values from data-car-type attributes.
That being said, this does seem like a bad design. I surely would never go to such length to get the "car type" data. It's probably written somewhere in your database. Get it from there.
I know this question is 4 year old but I came across this page when a similar question arose in my mind.
Short answer
It's not possible, unless your use Javascript on the front end side as a workaround. The accepted answer is false.
Explanation
Indeed, in the example above, try to print(request.POST) and you'll see that the QueryDict object request.POST received by the view does not contain any reference to the HTML5 data attribute you want to fetch. It's basically a kind of Python dictionary (with a few distinctive features, cf. documentation). Admittedly, if you print(myform) in the same example, you'll see some HTML code. But, this code is generated retroactively, when you associate data with the form. Thus, BeautifulSoup will never be able to find what you're looking for. From the Django documentation:
If the form is submitted using a POST request, the view will [...] create a form instance and populate it with data from the
request: form = NameForm(request.POST). This is called “binding data to
the form” (it is now a bound form).
Workaround
What I've done on my side and what I would suggest you to do is to use some Javascript on the client side to add the missing information to the form when it's submitted. For instance, it could look like this:
document.querySelector("#form_id").addEventListener("submit", function(e) {
const your_data_attribute = document.getElementById("XXX").dataset.yourInfo;
const another_hidden_field = document.getElementById("YYY");
another_hidden_field.value = your_data_attribute;
});
On Facebook I want to find fb_dtsg to make a status:
import urllib, urllib2, cookielib
jar = cookielib.CookieJar()
cookie = urllib2.HTTPCookieProcessor(jar)
opener = urllib2.build_opener(cookie)
data = urllib.urlencode({'email':"email",'pass':"password", "Log+In":"Log+In"})
req = urllib2.Request('http://www.facebook.com/login.php')
opener.open(req, data)
opener.open(req, data) #Needs to be opened twice to log on.
req2 = urllib2.Request("http://www.facebook.com/")
page = opener.open(req2)
fb_dtsg = page[page.find('name="fb_dtsg"') + 22:page.find('name="fb_dtsg"') + 33] #This just finds the value of "fb_dtsg".
Yes, this does find a value, and a value that looks like fb_dtsg would look like, but this value is always changing when I would open the webpage again and also when I would use it to make a status, it would not work, and when I would record what is happening on google chrome if I was making a status normally, I would get an working fb_dtsg value and it would not change (for a long session), and would work if I used it to try make a status. Please, please show me how I can fix this up without using the API.
The searching criteria to find fb_dtsg truncates last digit, so change 33 to 34
fb_dtsg = page[page.find('name="fb_dtsg"') + 22:page.find('name="fb_dtsg"') + 34]
Anyways you can use a better way of searching the fb_dtsg using re
re.findall('fb_dtsg.+?value="([^"]+)"',page)
As I answered in one of your early posts it may also require other hidden variables also.
If this still doesn't work, can you provide the code where you are making the post including all the post form data
BTW sorry for not looking at all your previous posts with same content :P
To give an overview of the problem, I have a list of Twitter users "screen_names" and I want to verify wether they are suspended users or not. I don't want to use the twitter search API to avoid the rate limits problem (the list is quite big). Therefore, I am trying to use a cluster of computers to label my dataset (wether an account in my database is suspended or not).
If an account is suspended by Twitter and you try to access them through the link http://www.twitter/screen_name you get redirected to https://twitter.com/account/suspended
I tried to capture this behaviour using python 2.7 with urlib using the geturl() method. It works but is not reliable (I don't get the same results on the same link). I tested it on the same account and yet sometimes it returns the https://twitter.com/account/suspended and some other times it returns http://www.twitter/screen_name
The same problem occurs with requests.
My code:
import requests
from lxml import html
screen_name = 'IaMaGuyGetIt'
account_url = "https://twitter.com/"+screen_name
url = requests.get(account_url)
print url.url
req = urllib.urlopen(url.url).read()
page = html.fromstring(req)
for heading in page.xpath("//h1"):
if heading.text == 'Account suspended':
print True
The twitter server only serves you the 302 redirect once; after that it'll assume your browser has cached the redirect.
The body of the page does contain a pointer though, so even if you were not redirected you can see that there is still the link there:
r = requests.get(account_url)
>>> r.url
u'https://twitter.com/IaMaGuyGetIt'
>>> r.text
u'<html><body>You are being redirected.</body></html>'
Look for that exact text.