I'm trying to scrape a webpage sending POST to fill a form, generally I use selenium to scrape a page with python, but I recently read that sending a POST request is a better way to scrape results. Anyway, I followed some instructions for make my code, but when I post my data, I get the same page with the form filled (the POST doesn't submit the form), what I'm doing wrong? Also the same page has another form to fill after the first, so if I achieve fill the first form I really don't know how to keep that response for get the final response, so if someone can help with some ideas, I reall would appreciate it! Thanks and I include my code and the page that I'm looking for scrape final quotation:
https://www.santander.cl/cotizador-web/
import requests, lxml.html
import time
s = requests.session()
login = s.get('https://www.santander.cl/cotizador-web/cotizador/pasosSolicitud.xhtml')
login_html = lxml.html.fromstring(login.text)
hidden_inputs = login_html.xpath(r'//form//input[#type="hidden"]')
form = {x.attrib["name"]: x.attrib["value"] for x in hidden_inputs}
form['pasosForm:marcas']='27'
form['pasosForm:modelos']='1978'
form['pasosForm:ano']='2015'
form['pasosForm:uso']='1'
form['pasosForm:j_id93373712_1a32e354_input']='on'
form['formDialogCotiSelec:j_id216370348_64c01a10_active'] = '1'
form['javax.faces.partial.execute']='pasosForm pasosForm:siguiente1'
response = s.post('https://www.santander.cl/cotizador-web/cotizador/pasosSolicitud.xhtml', data=form)
print(response.text)
I see that all forms have hidden field like this
<input type="hidden" name="javax.faces.ViewState"
id="javax.faces.ViewState" value="zDmSF7aJ4QSdyqjY5D4dGbfEaQr5OiS6WorNARY6pfHWSXIe/APb5e
/wcHsiGvPVaXW4IFpVHFyFHNSSJMPdHt2mhaYm4TQ9WPo+TQgWFTB1ZRE1wwiJtXQfmKuwE2+R+iRmONBAmZCR9E8x" />
It's csrf token and generated from current session. You should visit(create session) of form page, before you make post request.
More info here:
https://www.owasp.org/index.php/Cross-Site_Request_Forgery_%28CSRF%29_Prevention_Cheat_Sheet
Related
Can anyone help me to grab the date from this website?
I want to get the data from the new website after I click "Submit Query". That needs a post request because a form being submitted
https://henke.lbl.gov/optical_constants/pert_form.html
I tried multiple methods (post request) online but all failed. Don't know why.
Many thanks!
If You want to grab text contents of page for example, try:
import requests
r = requests.get('https://henke.lbl.gov/optical_constants/pert_form.html')
print(r.text)
For more go to https://docs.python-requests.org/en/master/
I'm trying to use requests to query a website for response information from a form post, but I get back different html content when using the site manually (filling out form and clicking submit button) on submitting a form than I get in my response.text object when using requests to post to the site.
When posting the form manually, the site redirects back to the form page, with new text (some new <h#> and <ul> objects) showing below the form. However, with requests.post, my response.text object just gives me the content of the page as if I were doing a requests.get, suggesting to me the redirect when using the site manually is different from the redirect I get from requests.
Any idea how I can get my response.text to match up with what I see using the site manually? Or maybe the response object isn't even what I should use to get that text? My thoughts are maybe the website manually redirects to the same form page as a POST, and requests forces redirect as a GET, and I need to override this feature somehow?
Here's code I'm using:
import requests
get_resp = requests.get(url="https://example-site.com")
# It's a Django site so I need to get the csrftoken
csrf_token = get_resp.cookies['csrftoken']
post_resp = requests.post(
url="https://example-site.com",
data={"key1": "value1",
"key2": "value2"},
headers={"X-CSRFToken": csrf_token},
)
print(post_resp.text)
Thank you for the help!
I am trying to create pages on my wordpress site automatically with a Python script. With mechanize, I log in and fill the corresponding form on my dashboard but can't seem to publish. When submitting the form, I get an error page that I can't access manually. Its url is http://example.com/wordpress/wp-admin/post.php?post=443&action=edit&message=10. However, the page seems to have been created since it is in the drafts with the correct title, the correct content and Wordpress tells me that it is currently being modified by my bot's account. Here is the code I am using to fill and submit my form:
br = mechanize.Browser()
br.open("http://example.com/wordpress/wp-admin/post-new.php?post_type=page")
br.select_form(name = "post")
br.form.set_value(title, name = "post_title")
br.form.set_value(content, name = "content")
br.submit()
Thank you for your help,
Ryunos
Basically I want to send a POST request for the following form.
<form method="post" action="">
449 * 803 - 433 * 406 = <input size=6 type="text" name="answer" />
<input type="submit" name="submitbtn" value="Submit" />
</form>
What I basically want to do is read through the page, find out the equation in the form, calculate the answer, enter the answer as parameter to send with the POST request, but without opening a new URL for the page, as a new equation comes up every time the page is opened, hence the previously obtained result becomes obsolete. Finally I want to obtain the page that comes up as a result of sending the POST request. I'm stuck at the part where I have to send a POST request without opening a new URL instance. Also, I would appreciate help on how to read through the page again after the POST request. (would calling read() suffice?)
The python code I have currently looks something like this.
import urllib, urllib2
link = "http://www.websitetoaccess.com"
f = urllib2.urlopen(link)
line = f.readline().strip()
equation = ''
result = ''
file1 = open ('firstPage.html' , 'w')
file2 = open ('FinalPage.html', 'w')
for line in f:
if 'name="answer"' in line:
result = getResult(line)
file1.write(line)
file1.close()
raw_params = {'answer': str(result), 'submit': 'Submit'}
params = urllib.urlencode(raw_params)
request = urllib2.Request(link, params)
page = urllib2.urlopen(request)
file2.write(page.read())
file2.close()
Yeah, that last link really helped, turns out I just needed to create a new session from requests like so:
s = requests.session()
res1 = s.get(url)
And add this as the post request after
res2 = s.post(url, data=post_params)
I believe this achieves the result of storing the cookies from the get request and sending them with the post request, thus maintaining the same question as the previous get request. Many thanks for your help and assistance in this problem Loknar.
I'm a bit puzzled, the POST request will always be a new separate request so I don't understand what you mean by "without opening a new URL instance" ... have you tried taking a look at what happens when you do what you're trying to do in this script manually? Like open developer console in Chrome, go to the network tab, toggle preserve log to on, delete history, and do what you're trying to do manually? Then replicate that in python? Also I recommend you try out the requests module, it makes things simpler than using urllib. Simply pip install requests (and pip install lxml).
import requests
from lxml import etree
url = 'http://www.websitetoaccess.com'
res1 = requests.get(url)
# do something with res1.content
# you could try parsing the html page with lxml
root = etree.fromstring(res1.content, etree.HTMLParser())
# do something with root, find question and calc answer?
post_params = {'answer': str(42), 'submit': 'Submit'}
res2 = requests.post(url, data=post_params)
# check res2 for success or content?
edit:
You're possibly experiencing some header issue or cookies issue. You might be receiving some session ID which enables the server to determine what question you received in the previous GET request. The POST request is a separate request from the previous GET request, it can't be combined to one single request. You should check the headers received from the previous GET request and/or try setting up session/cookies handling (easy to do if using requests, see https://requests.readthedocs.io/en/master/user/advanced/).
I need to solve this problem. I have to send a form using post request, but I have to send it to the page where the form is, and in all the solutions that I find to send the request you have to send it to the page where the results come, and to me this doesn't work because the page that I need to send the request to has a specific SID to each search you enter. Is there any way I could do this?
Example of code that doesn't work for my problem:
post_data = [("doesn't","work")]
result = urllib2.urlopen('http://www.for.my.problem', urllib.urlencode(post_data))
content = result.read()
thank you
If the request needs to send to same page where the request comes, why just call the response method directly since they are in the same page, the method can be accessed.