I'm sure that it's pretty easy, but I really can understand it.
I need to write a script with python. The script has to take a link and send this it to http://archive.org/web/, to be more precise the script has to put this link to form "Save page now":
<form id='wwmform_save' name="wwmform_save" method="get" action="#" onsubmit="if (''==$('#web_save_url').val()){$('#web_save_url').attr('placeholder', 'enter a web address')} else {document.location.href='//web.archive.org/save/'+$('#web_save_url').val();} return false;" style="display:inline;">
<input id='web_save_url' class="web_input web_text" type="text" name="url" placeholder="http://" />
<button id='web_save_button' type="submit" class="web_button web_text">SAVE PAGE</button>
</form>
And get an achieved link.
I would like to use the "Requests" library, but can't understan how.
Should I make a request first?
I think I use request.post but I don't realize what parameters I have to use.
Edited:
I did like n1c9 has written, it words and saves links but also I need the link where page was saved. When send request to http://web.archive.org/save/(link) it's loadind few seconds and then redirect to the needed link.
url = 'urlyouwanttoarchive.com'
archive = 'http://web.archive.org/save/'
requests.get(archive + url)
and if you want the URL to the newly archived page:
print(archive + url)
edit: if you had a list of URLs you wanted to archive, this would work too:
urls = ['url1.com','url2.com','url3.com','url4.com']
for i in urls:
requests.get(archive + i)
Related
I'm a newbie trying to extract files from this web page http://www.cnmv.es/ipps/ (Spanish companies information)
The problem is that I have to fill first a few fields (company, semester, year) and then click on download. Using the browser, it starts a download of a .zip file that contains one or more .xbrl files, but I can't find a way to do it in python via requests or similar (there is no URL in the download button), getting the file content to a variable and save a file in a path.
What I tried is what I could find in the web about similar issues, I read something about ajax, json, beautifulsoup... but no results. My actual script is wrong because the only thing I get is the response but not the target file, and I need your help, please.
Here you can find a draft of what i have in mind, this is similar to my actual script.
from requests import Session
s = Session()
Company = [''] #Companies string array
Semester = [''] #Semester string array
Year = [''] #Years string array
for x in range(Company):
for y in range(Semester):
for z in range(Year):
#request the data and receive the desired information
response = s.post(
url='http://www.cnmv.es/ipps/',
data = {
'wDescargas$drpEntidades': Company[x], #search parameters
'wDescargas$drpPeriodos': Semester[y],
'wDescargas$drpEjercicios': Year[z])
},
headers={
'Referer': 'http://www.cnmv.es/ipps/',
}
)
#save the content of the target file in a path
data = response.content
filename = Semester[y] + Company[x] + Year[z]
with open(filename,'w+b') as s:
s.write(data)
Thanks a lot for your help.
I would suggest you use the selenium package in Python3 to automate the whole process because that site is using the .NET framework and there are a lot of POST values other than 'wDescargas$drpEntidades', 'wDescargas$drpPeriodos', 'wDescargas$drpEjercicios'...
Just check the source code of that website and you'll understand why using requests package is not a good choice here...
<form name="form1" method="post" action="./" id="form1">
<div>
<input type="hidden" name="__EVENTTARGET" id="__EVENTTARGET" value="" />
<input type="hidden" name="__EVENTARGUMENT" id="__EVENTARGUMENT" value="" />
<input type="hidden" name="__LASTFOCUS" id="__LASTFOCUS" value="" />
<input type="hidden" name="__VIEWSTATE" id="__VIEWSTATE" value="/wEPDwULLTE5MTQ0MzYzMDEPFgQeD01vZG9WaXN0YVBhZ2luYQspYUNOTVYuWEJSTElQUC5Nb2RvVmlzdGFQYWdpbmEsIENOTVYuWEJSTElQUCwgVmVyc2lvbj0xLjAuMC4wLCBDdWx0dXJlPW5ldXRyYWwsIFB1YmxpY0tleVRva2VuPW51bGwBHgxXZWJFbnRpZGFkZXMyrQIAAQAAAP////8BAAAAAAAAAAwCAAAAQ0NOTVYuWEJSTElQUCwgVmVyc2lvbj0xLjAuMC4wLCBDdWx0dXJlPW5ldXRyYWwsIFB1YmxpY0tleVRva2VuPW51bGwFAQAAAChDTk1WLlhCUkxJUFAuRW50aXRpZXMuV2ViRW50aWRhZGVzRUVMaXN0AwAAAA1MaXN0YDErX2l0ZW1zDExpc3RgMStfc2l6ZQ9MaXN0YDErX3ZlcnNpb24EAAAkQ05NVi5YQlJMSVBQLkVudGl0aWVzLldlYkVudGlkYWRFRVtdAgAAAAgIAgAAAAkDAAAAAAAAAAAAAAAHAwAAAAABAAAAAAAAAAQiQ05NVi5YQlJMSVBQLkVudGl0aWVzLldlYkVudGlkYWRFRQIAAAALFgICAw9kFgQCBw9kFgJmD2QWBgIBDw8WBh4IQ3NzQ2xhc3MFKWJ0biBidG4tT3BjaW9uIGNlbnRlci1ibG9jayBBY3Rpdm8gQWN0aXZvHgdFbmFibGVkaB4EXyFTQgICZBYCAgEPDxYCHghJbWFnZVVybAUWfi9pbWFnZXMvT3BjaW9uMU9uLmdpZmRkAgMPDxYEHwIFG2J0biBidG4tT3BjaW9uIGNlbnRlci1ibG9jax8EAgJkFgICAQ8PFgIfBQUXfi9pbWFnZXMvT3BjaW9uMk9mZi5naWZkZAIHD2QWBGYPZBYCZg9kFgwCAw8QZGQWAWZkAgUPZBYCAgMPEGRkFgBkAgcPZBYCAgMPEGRkFgBkAgkPZBYCAgMPEGRkFgBkAgsPZBYCAgMPEGRkFgBkAg0PZBYCAgMPEGRkFgBkAgEPZBYCAgEPZBYGAggPZBYCAgMPEGRkFgBkAgoPZBYCAgMPEGRkFgBkAgwPZBYEAgMPEGRkFgFmZAIHDxBkZBYAZAIJD2QWAmYPZBYEAgEPZBYEAgEPDxYCHwUFFn4vaW1hZ2VzL09wY2lvbjFPbi5naWZkZAIDDxYCHgRUZXh0BR5WaXN1YWxpemFjacOzbiBkZSBpbmZvcm1lcyBJUFBkAgMPZBYGZg9kFgICAw8WAh8GBaYEIGEgbGEgaGVycmFtaWVudGEgZGUgY29uc3VsdGEgeSBkZXNjYXJnYSBkZSBpbmZvcm1lcyBYQlJMIGNvbiBsb3MgZXN0YWRvcyBmaW5hbmNpZXJvcyBlbnZpYWRvcyBhIGxhIENOTVYgZW4gZWwgbWFyY28gZXN0YWJsZWNpZG8gcG9yIGxhcyBDaXJjdWxhcmVzIDxhIGhyZWY9Jy9JUFAvdGF4b25vbWlhLzIwMTYtMDYtMDEvaXBwXzIwMTYtMDYtMDEuemlwJyB0aXRsZT0nSXIgYSBDaXJjdWxhciA1LzIwMTUnPjUvMjAxNTwvYT4sIDxhIGhyZWY9Jy9JUFAvdGF4b25vbWlhLzIwMDgtMDEtMDEvaXBwXzIwMDgtMDEtMDEuemlwJyB0aXRsZT0nSXIgYSBDaXJjdWxhciAxLzIwMDgnPjEvMjAwODwvYT4geSA8YSBocmVmPScvSVBQL3RheG9ub21pYS8yMDA1LTA2LTMwL2lwcF8yMDA1LTA2LTMwX3YxLjIyLnppcCcgdGl0bGU9J0lyIGEgQ2lyY3VsYXIgMS8yMDA1Jz4xLzIwMDU8L2E+IHBhcmEgZWwgcmVwb3J0ZSBkZSBsYSBJbmZvcm1hY2nDs24gUMO6YmxpY2EgUGVyacOzZGljYSBkZSBsYXMgZW50aWRhZGVzIGNvbiB2YWxvcmVzIGFkbWl0aWRvcyBhIGNvdGl6YWNpw7NuLmQCAQ9kFgICDQ9kFgQCAQ9kFgYCAQ9kFgICAw8QZGQWAGQCAw9kFgICAw8QZGQWAGQCBQ9kFgICAw8QZGQWAGQCAg9kFgYCAQ9kFgZmDxBkZBYAZAIBDxBkZBYAZAIDDxBkZBYBZmQCAw9kFgoCAw8QZGQWAGQCBQ8QZGQWAGQCBw8QZGQWAGQCDQ8QZGQWAQIBZAIRDxBkZBYBZmQCBQ9kFgoCAw8QZGQWAGQCBQ8QZGQWAGQCBw8QZGQWAGQCDQ8QZGQWAQIBZAIRDxBkZBYBZmQCAg9kFgQCAQ9kFggCCQ8QZGQWAGQCEw8QZGQWAGQCFw8QZGQWAGQCGw8QZGQWAGQCAw9kFgICAQ88KwARAgEQFgAWABYADBQrAABkGAMFC212Q29udGVuaWRvDw9kZmQFH3dEZXNjYXJnYXNfTGlzdGFkbyRncmlkSW5mb3JtZXMPZ2QFDG12TW9kb1BhZ2luYQ8PZGZkIeVZrmpOfPFqcHtyXhTP+ho+VemJP+fQZiuA1wu5cOc=" />
I am trying to use requests to fill out a form on https://www.doleta.gov/tradeact/taa/taa_search_form.cfm and return the HTML of the new page that this opens and extract information from the new page.
Here is the relevant HTML
<form action="taa_search.cfm" method="post" name="number_search" id="number_search" onsubmit="return validate(this);">
<label for="input">Petition number</label>
:
<input name="input" type="text" size="7" maxlength="7" id="input">
<input type="hidden" name="form_name" value="number_search" />
<input type=submit value="Get TAA information" />
</form>
Here is the python code I am trying to use.
url = 'https://www.doleta.gov/tradeact/taa/taa_search.cfm'
payload = {'number_search':'11111'}
r = requests.get(url, params=payload)
with open("requests_results1.html", "wb") as f:
f.write(r.content)
When you perform the query manually, this page opens https://www.doleta.gov/tradeact/taa/taa_search.cfm.
However, when I use the above Python code, it returns the HTML of https://www.doleta.gov/tradeact/taa/taa_search_form.cfm (the first page) and nothing is different.
I cannot perform similar code on https://www.doleta.gov/tradeact/taa/taa_search.cfm because it redirects to the first URL and thus, running the code returns the HTML of the first URL.
Because of the permissions setup of my computer, I cannot redirect the path of my PC (which means Selenium is off the table) and I cannot install Python 2 (which means mechanize is off the table). I am open to using urllib but do not know the library very well.
I need to perform this action ~10,000 times to scrap the information. I can build the iteration part myself, but I cannot figure out how to get the base function to work properly.
The first observation is that you seem to be using a get request in your example code instead of a post request.
<form action="taa_search.cfm" method="post" ...>
^^^^^^^^^^^^^
After changing to a post request, I was still getting the same results as you though (html from the main search form page). After a bit of experimentation, I seem to be able to get the proper html results by adding a referer to the header.
Here is the code (I only commented out the writing to file part for example purposes):
import requests
BASE_URL = 'https://www.doleta.gov/tradeact/taa'
def get_case_decision(case_number):
headers = {
'referer': '{}/taa_search_form.cfm'.format(BASE_URL)
}
payload = {
'form_name': 'number_search',
'input': case_number
}
r = requests.post(
'{}/taa_search.cfm'.format(BASE_URL),
data=payload,
headers=headers
)
r.raise_for_status()
return r.text
# with open('requests_results_{}.html'.format(case_number), 'wb') as f:
# f.write(r.content)
Testing:
>>> result = get_case_decision(10000)
>>> 'MODINE MFG. COMPANY' in result
True
>>> '9/12/1980' in result
True
>>> result = get_case_decision(10001)
>>> 'MUSKIN CORPORATION' in result
True
>>> '2/27/1981' in result
True
Since you mentioned that you need to perform this ~10,000 times, you will probably want to look into using requests.Session as well.
I'm trying to "develop" an application that stores todo tasks in a table. These tasks are organised like sub tasks for a "head" task.
The problem is that when submitting an input as a sub task via POST I need the name of the head-task as well. I can't figure out how to send the information have stored in my <h1>-label with the information I have in my <input>.
Here is my code example:
<h1>Taskheader</h1>
<form action="/todo/task/" method="post">
<input type="text" name="Taskitem">
<input type="submit" value="Submit" />
</form>
Is it possible to send this information off, without using overly complicated code for a beginner?
give the h1 an id for example id = "header-task-name" then get the h1 element by that id and get the input text both using javascript, finally send the task header and the input text using post and ajax to the server.
another approach is to wrap each task header and it's form in a div then get the parent div of the form using this :
var x = document.getElementById("the form").parentElement.nodeName;
then you can access the header of the form by using the parent div to get the h1 child
I have exactly the same problem as this post
Python submitting webform using requests
but your answers do not solve it. When I execute this HTML file called api.htm in the browser, then for a second or so I see its page.
Then the browser shows the data I want with the URL https://api.someserver.com/api/ as as per the action below. But I want the data written to a file so I try the Python 2.7 script below.
But all I get is the source code of api.htm Please put me on the right track!
<html>
<body>
<form id="ID" method="post" action="https://api.someserver.com/api/ ">
<input type="hidden" name="key" value="passkey">
<input type="text" name="start" value ="2015-05-01">
<input type="text" name="end" value ="2015-05-31">
<input type="submit" value ="Submit">
</form>
<script type="text/javascript">
document.getElementById("ID").submit();
</script>
</body>
</html>
The code:
import urllib
import requests
def main():
try:
values = {'start' : '2015-05-01',
'end' : '2015-05-31'}
req=requests.post("http://my-api-page.com/api.htm",
data=urllib.urlencode(values))
filename = "datafile.csv"
output = open(filename,'wb')
output.write(req.text)
output.close()
return
main()
I can see several problems:
Your post target URL is incorrect. The form action attribute tells you where to post to:
<form id="ID" method="post" action="https://api.someserver.com/api/ ">
You are not including all the fields; type=hidden fields need to be posted too, but you are ignoring this one:
<input type="hidden" name="key" value="passkey">
Do not URL-encode your POST variables yourself; leave this to requests to do for you. By encoding yourself requests won't recognise that you are using an application/x-www-form-urlencoded content type as the body. Just pass in the dictionary as the data parameters and it'll be encoded for you and the header will be set.
You can also stream the response straight to a file object; this is helpful when the response is large. Switch on response streaming, make sure the underlying raw urllib3 file-like object decodes from transfer encoding and use shutil.copyfileobj to write to disk:
import requests
import shutil
def main():
values = {
'start': '2015-05-01',
'end': '2015-05-31',
'key': 'passkey',
}
req = requests.post("http://my-api-page.com/api.htm",
data=values, stream=True)
if req.status_code == 200:
with open("datafile.csv", 'wb') as output:
req.raw.decode_content = True
shutil.copyfileobj(req.raw, output)
There may still be issues with that key value however; perhaps the server sets a new value for each session, coupled with a cookie, for example. In that case you'd have to use a Session() object to preserve cookies, first do a GET request to the api.htm page, parse out the key hidden field value and only then post. If that is the case then using a tool like robobrowser might just be easier.
I have a small .py program, rendering 2 HTML pages. One of those HTML pages has a form in it. A basic form requesting a name, and a comment. I can not figure out how to take the name and the comment from the form and store it into the csv file. I have got the coding so that the very little I already manually input into the csv file is printed/returned on the HTML page, which is one of the goals. But I can't get the data I input into the form into the csv file, then back n the HTML page. I feel like this is a simple fix, but the Flask book makes absolutely no sense to me, I'm dyslexic and I find it impossible to make sense of the examples and the written explanations.
This is the code I have for reading the csv back onto the page;
#app.route('/guestbook')
def guestbook():
with open('nameList.csv','r') as inFile:
reader=csv.reader(inFile)
names=[row for row in reader]
return render_template('guestbook.html',names=names[1:])
And this is my form coding;
<h3 class="tab">Feel free to enter your comments below</h3>
<br />
<br />
<form action="" method="get" enctype="text/plain" name="Comments Form">
<input id="namebox" type="text" maxlength="45" size="32" placeholder="Name"
class="tab"/>
<br />
<textarea id="txt1" class="textbox tab" rows="6" placeholder="Your comment"
class="tab" cols="28"></textarea>
<br />
<button class="menuitem tab" onclick="clearComment()" class="tab">Clear
comment</button>
<button class="menuitem" onclick="saveComment()" class="tab">Save comment</button>
<br>
</div>
By what I understand all you need is to save the data into the file and you don't know how to handle this in Flask, I'll try to explain it with code as clear as possible:
# request is a part of Flask's HTTP requests
from flask import request
import csv
# methods is an array that's used in Flask which requests' methods are
# allowed to be performed in this route.
#app.route('/save-comment', methods=['POST'])
def save_comment():
# This is to make sure the HTTP method is POST and not any other
if request.method == 'POST':
# request.form is a dictionary that contains the form sent through
# the HTTP request. This work by getting the name="xxx" attribute of
# the html form field. So, if you want to get the name, your input
# should be something like this: <input type="text" name="name" />.
name = request.form['name']
comment = request.form['comment']
# This array is the fields your csv file has and in the following code
# you'll see how it will be used. Change it to your actual csv's fields.
fieldnames = ['name', 'comment']
# We repeat the same step as the reading, but with "w" to indicate
# the file is going to be written.
with open('nameList.csv','w') as inFile:
# DictWriter will help you write the file easily by treating the
# csv as a python's class and will allow you to work with
# dictionaries instead of having to add the csv manually.
writer = csv.DictWriter(inFile, fieldnames=fieldnames)
# writerow() will write a row in your csv file
writer.writerow({'name': name, 'comment': comment})
# And you return a text or a template, but if you don't return anything
# this code will never work.
return 'Thanks for your input!'