I am trying to download the CSV files from this page, via a python script.
But when I try to access the CSV file directly by links in my browser, an agreement form is displayed. I have to agree to this form before I am allowed to download the file.
The exact URLs to the csv files can't be retrieved. It is a value being sent to backend db which fetches the file - e.g PERIOD_ID=2013-0:
https://www.paoilandgasreporting.state.pa.us/publicreports/Modules/DataExports/ExportProductionData.aspx?PERIOD_ID=2013-0
I've tried urllib2.open() and urllib2.read(), but it leads to the html content of agreement form, not the file content.
How do i write a python code which handles this re-direct and then fetches me the CSV file and let me save on disk ?
You need to set the ASP.NET_SessionId cookie. You can find this by using Chrome's Inspect element option in the context menu, or by using Firefox and the Firebug extension.
With Chrome:
Right-click on the webpage (after you've agreed to the terms) and select Inspect element
Click Resources -> Cookies
Select the only element in the list
Copy the Value of the ASP.NET_SessionId element
With Firebug:
Right-click on the webpage (after you've agreed to the terms), and click *Inspect Element with Firebug
Click Cookies
Copy the Value of the ASP.NET_SessionId element
In my case, I got ihbjzynwfcfvq4nzkncbviou - it might work for you, if not you need to perform the above procedure.
Add the cookie to your request, and download the file using the requests module (based on an answer by eladc):
import requests
cookies = {'ASP.NET_SessionId': 'ihbjzynwfcfvq4nzkncbviou'}
r = requests.get(
url=('https://www.paoilandgasreporting.state.pa.us/publicreports/Modules/'
'DataExports/ExportProductionData.aspx?PERIOD_ID=2013-0'),
cookies=cookies
)
with open('2013-0.csv', 'wb') as ofile:
for chunk in r.iter_content(chunk_size=1024):
ofile.write(chunk)
ofile.flush()
Here's my suggestion, for automatically applying the server cookies and basically mimicking standard client session behavior.
(Shamelessly inspired by #pope's answer 554580.)
import urllib2
import urllib
from lxml import etree
_TARGET_URL = 'https://www.paoilandgasreporting.state.pa.us/publicreports/Modules/DataExports/ExportProductionData.aspx?PERIOD_ID=2013-0'
_AGREEMENT_URL = 'https://www.paoilandgasreporting.state.pa.us/publicreports/Modules/Welcome/Agreement.aspx'
_CSV_OUTPUT = 'urllib2_ProdExport2013-0.csv'
class _MyHTTPRedirectHandler(urllib2.HTTPRedirectHandler):
def http_error_302(self, req, fp, code, msg, headers):
print 'Follow redirect...' # Any cookie manipulation in-between redirects should be implemented here.
return urllib2.HTTPRedirectHandler.http_error_302(self, req, fp, code, msg, headers)
http_error_301 = http_error_303 = http_error_307 = http_error_302
cookie_processor = urllib2.HTTPCookieProcessor()
opener = urllib2.build_opener(_MyHTTPRedirectHandler, cookie_processor)
urllib2.install_opener(opener)
response_html = urllib2.urlopen(_TARGET_URL).read()
print 'Cookies collected:', cookie_processor.cookiejar
page_node, submit_form = etree.HTML(response_html), {} # ElementTree node + dict for storing hidden input fields.
for input_name in ['ctl00$MainContent$AgreeButton', '__EVENTVALIDATION', '__VIEWSTATE']: # Form `input` fields used on the ``Agreement.aspx`` page.
submit_form[input_name] = page_node.xpath('//input[#name="%s"][1]' % input_name)[0].attrib['value']
print 'Form input \'%s\' found (value: \'%s\')' % (input_name, submit_form[input_name])
# Submits the agreement form back to ``_AGREEMENT_URL``, which redirects to the CSV download at ``_TARGET_URL``.
csv_output = opener.open(_AGREEMENT_URL, data=urllib.urlencode(submit_form)).read()
print csv_output
with file(_CSV_OUTPUT, 'wb') as f: # Dumps the CSV output to ``_CSV_OUTPUT``.
f.write(csv_output)
f.close()
Good luck!
[Edit]
On the why of things, I think #Steinar Lima is correct with respect to requiring a session cookie. Though unless you've already visited the Agreement.aspx page and submitted a response via the provider's website, the cookie you copy from the browser's web inspector will only result in another redirect to the Welcome to the PA DEP Oil & Gas Reporting Website welcome page. Which of course eliminates the whole point of having a Python script do the job for you.
Related
I'm trying to access an authenticated site using a cookies.txt file (generated with a Chrome extension) with Python Requests:
import requests, cookielib
cj = cookielib.MozillaCookieJar('cookies.txt')
cj.load()
r = requests.get(url, cookies=cj)
It doesn't throw any error or exception, but yields the login screen, incorrectly. However, I know that my cookie file is valid, because I can successfully retrieve my content using it with wget. Any idea what I'm doing wrong?
Edit:
I'm tracing cookielib.MozillaCookieJar._really_load and can verify that the cookies are correctly parsed (i.e. they have the correct values for the domain, path, secure, etc. tokens). But as the transaction is still resulting in the login form, it seems that wget must be doing something additional (as the exact same cookies.txt file works for it).
MozillaCookieJar inherits from FileCookieJar which has the following docstring in its constructor:
Cookies are NOT loaded from the named file until either the .load() or
.revert() method is called.
You need to call .load() method then.
Also, like Jermaine Xu noted the first line of the file needs to contain either # Netscape HTTP Cookie File or # HTTP Cookie File string. Files generated by the plugin you use do not contain such a string so you have to insert it yourself. I raised appropriate bug at http://code.google.com/p/cookie-txt-export/issues/detail?id=5
EDIT
Session cookies are saved with 0 in the 5th column. If you don't pass ignore_expires=True to load() method all such cookies are discarded when loading from a file.
File session_cookie.txt:
# Netscape HTTP Cookie File
.domain.com TRUE / FALSE 0 name value
Python script:
import cookielib
cj = cookielib.MozillaCookieJar('session_cookie.txt')
cj.load()
print len(cj)
Output:
0
EDIT 2
Although we managed to get cookies into the jar above they are subsequently discarded by cookielib because they still have 0 value in the expires attribute. To prevent this we have to set the expire time to some future time like so:
for cookie in cj:
# set cookie expire date to 14 days from now
cookie.expires = time.time() + 14 * 24 * 3600
EDIT 3
I checked both wget and curl and both use 0 expiry time to denote session cookies which means it's the de facto standard. However Python's implementation uses empty string for the same purpose hence the problem raised in the question. I think Python's behavior in this regard should be in line with what wget and curl do and that's why I raised the bug at http://bugs.python.org/issue17164
I'll note that replacing 0s with empty strings in the 5th column of the input file and passing ignore_discard=True to load() is the alternate way of solving the problem (no need to change expiry time in this case).
I tried taking into account everything that Piotr Dobrogost had valiantly figured out about MozillaCookieJar but to no avail. I got fed up and just parsed the damn cookies.txt myself and now all is well:
import re
import requests
def parseCookieFile(cookiefile):
"""Parse a cookies.txt file and return a dictionary of key value pairs
compatible with requests."""
cookies = {}
with open (cookiefile, 'r') as fp:
for line in fp:
if not re.match(r'^\#', line):
lineFields = line.strip().split('\t')
cookies[lineFields[5]] = lineFields[6]
return cookies
cookies = parseCookieFile('cookies.txt')
import pprint
pprint.pprint(cookies)
r = requests.get('https://example.com', cookies=cookies)
This worked for me:
from http.cookiejar import MozillaCookieJar
from pathlib import Path
import requests
cookies = Path('/Users/name/cookies.txt')
jar = MozillaCookieJar(cookies)
jar.load()
requests.get('https://path.to.site.com', cookies=jar)
<Response [200]>
I tried editing Tristan answer to add some info to it but it seems SO edit q is full therefore, I am writing this answer, since, I have struggled real bad on using existing cookies with python request.
First, get the cookies from the Chrome. Easiest way would be to use an extension called 'cookies.txt'
https://chrome.google.com/webstore/detail/get-cookiestxt/bgaddhkoddajcdgocldbbfleckgcbcid/related
After downloading those cookies, use the below code to make sure that you are able to parse the file without any issues.
import re, requests, pprint
def parseCookieFile(cookiefile):
"""Parse a cookies.txt file and return a dictionary of key value pairs
compatible with requests."""
cookies = {}
with open (cookiefile, 'r') as fp:
for line in fp:
if not re.match(r'^\#', line):
lineFields = re.findall(r'[^\s]+', line) #capturing anything but empty space
try:
cookies[lineFields[5]] = lineFields[6]
except Exception as e:
print (e)
return cookies
cookies = parseCookieFile('cookies.txt') #replace the filename
pprint.pprint(cookies)
Next, use those cookies with python request
x = requests.get('your__url', verify=False, cookies=cookies)
print (x.content)
This should save your day from going on different SO posts and trying those cookielib and other methods which never worked for me.
I finally found a way to make it work (I got the idea by looking at curl's verbose ouput): instead of loading my cookies from a file, I simply created a dict with the required value/name pairs:
cd = {'v1': 'n1', 'v2': 'n2'}
r = requests.get(url, cookies=cd)
and it worked (although it doesn't explain why the previous method didn't). Thanks for all the help, it's really appreciated.
My requirement: Read contents from a input type="file" with ID= "rtfile1" and write it to a textarea with ID- "rt1"
Based on the documentation on [https://brython.info/][1] I tried to read a file but it fails with this error:
Access to XMLHttpRequest at 'file:///C:/fakepath/requirements.txt' from origin 'http://example.com:8000' has been blocked by CORS policy: Cross origin requests are only supported for protocol schemes: http, data, chrome, chrome-extension, https.
I tried following two Brython codes, both of them failed with the same aforementioned error.
Code 1:
def file_read(ev):
doc['rt1'].value = open(doc['rtfile1'].value).read()
doc["rtfile1"].bind("input", file_read)
Code 2:
def file_read(ev):
def on_complete(req):
if req.status==200 or req.status==0:
doc['rt1'].value = req.text
else:
doc['rt1'].value = "error "+req.text
def err_msg():
doc['rt1'].value = "server didn't reply after %s seconds" %timeout
timeout = 4
def go(url):
req = ajax.ajax()
req.bind("complete", on_complete)
req.set_timeout(timeout, err_msg)
req.open('GET', url, True)
req.send()
print('Triggered')
go(doc['rtfile1'].value)
doc["rtfile1"].bind("input", file_read)
Any help would be greatly appreciated. Thanks!!! :)
It's not related to Brython (you would have the same result with the equivalent Javascript), but to the way you tell the browser which file you want to upload.
If you select the file by an HTML tag such as
<input type="file" id="rtfile1">
the object referenced by doc['rtfile1'] in the Brython code has an attribute value, but it is not the file path or url, it's a "fakepath" built by the browser (as you can see in the error message), and you can't use it as an argument of the Brython function open(), or as a url to send an Ajax request to; if you want to use the file url, you should enter it in a basic input tag (without type="file").
It is better to select the file with type="file", but in this case the object doc['rtfile1'] is a FileList object, described in the DOM's Web API, whose first element is a File object. Reading its content is unfortunately not as simple as with open(), but here is a working example:
from browser import window, document as doc
def file_read(ev):
def onload(event):
"""Triggered when file is read. The FileReader instance is
event.target.
The file content, as text, is the FileReader instance's "result"
attribute."""
doc['rt1'].value = event.target.result
# Get the selected file as a DOM File object
file = doc['rtfile1'].files[0]
# Create a new DOM FileReader instance
reader = window.FileReader.new()
# Read the file content as text
reader.readAsText(file)
reader.bind("load", onload)
doc["rtfile1"].bind("input", file_read)
Python - 2.7.5
Google Chrome
First off I am self taught coder and will accept any critique and/or suggestions to any of my posted codes below. This issue has been a joy to work through because I love challenging myself but I am afraid I have hit a brick wall and need some guidance. I will be as detailed as possible below to fully explain the overall picture of my script and then show where I am at with the actual issue that is explained in the title.
I am putting together a script that will go out and download data automatically, upzip, and export to a GDB. We serve a wide region of users and have a very large enterprise SDE setup containing large amount of public data that we have to go out and search and update for our end users. Most of our data is updated monthly by local government entities and we have to go out and search for the data manually, download, unzip, QAQC, etc. I am wanting to put a script a together that will automate the first part of this process by going out and downloading all my data for me and exporting to a local GDB, from there I can QAQC everything and upload to our SDE for our users to access.
The process has been pretty straight forward so far until I got to this issue I have before me. My script will search a webpage for specific keywords and find the relevant link and begin the download. For this post I will use two examples, one that works and one that is currently giving me issues. What works is my function for searching and downloading the Metro GIS dataset and below shows my current process for finding this. So far all http websites I have included will use the posted function below. Like Metro is being shown I plan on having a defined function for each group of data.
import requests, zipfile, StringIO, time, arcpy, urllib2, urlparse
from BeautifulSoup import BeautifulSoup
arcpy.env.overwriteOutput = True
workPath = -- #The output GDB
timestr = time.strftime("%Y%m%d")
gdbName = "GlobalSDEUpdate_" + timestr
gdbPath = workPath + "\\" + gdbName + ".gdb"
class global_DataFinder(object):
def __init__(self):
object.__init__(self)
self.gdbSetup()
self.metro()
def gdbSetup(self):
arcpy.CreateFileGDB_management(workPath, gdbName)
def fileDownload(self, key, url, dlPath, dsName):
page = urllib2.urlopen(url).read()
urlList = []
soup = BeautifulSoup(page)
soup.prettify()
for link in soup.findAll('a', href = True):
if not 'http://' in link['href']:
if urlparse.urljoin(url, link['href']) not in urlList:
zipDL = urlparse.urljoin(url, link['href'])
if zipDL.endswith(".zip"):
if key in zipDL:
urlList.append(zipDL)
for x in urlList:
print x
r = requests.get(x, stream=True)
z = zipfile.ZipFile(StringIO.StringIO(r.content))
z.extractall(dlPath)
arcpy.CreateFeatureDataset_management(gdbPath, dsName)
arcpy.env.workspace = dlPath
shpList = []
for shp in arcpy.ListFeatureClasses():
shpList.append(shp)
arcpy.FeatureClassToGeodatabase_conversion(shpList, (gdbPath + "\\" + dsName))
del shpList[:]
def metro(self):
key = "METRO_GIS_Data_Layers"
url = "http://www.ridemetro.org/Pages/NewsDownloads.aspx"
dlPath = -- *#Where my zipfiles output to*
dsName = "Metro"
self.fileDownload(key, url, dlPath, dsName)
global_DataFinder()
As you can see above this is the method I started with using Metro as my first testing point and this is currently working great. I was hoping all my sites going forward would like this but when I got to FEMA I ran into an issue.
The website National Flood Hazard Layer (NFHL) Status hosts floodplain data for many counties across the country is available for free to any who wish to use it. When arriving at the website you will see that you can search for the county you want, then the table queries out the search, then you can simply click and download the county you desire. When checking the source this is what I came across and noticed its in an iframe.
When accessing the iframe source link through Chrome and checking the png source url this is what you get - https://hazards.fema.gov/femaportal/NFHL/searchResult
Now here is where my problem lies, unlike http sites I have quickly learned that accessing a secured https site and scraping the page is different especially when its using javascript to show the table. I have spent hours searching through forums and tried different python packages like selenium, mechanize, requests, urllib, urllib2, and I seem to always hit a dead-end before I can securely establish a connection and parse the webpage and search for my counties zipfile. The code below shows the closest I have gotten and shows the error code I am getting.
(I always test in a separate script and then when it works I bring it over to my main script, so thats why this code snippet below is separated from my original)
import urllib2, httplib, socket, ssl
from BeautifulSoup import BeautifulSoup
url = "http://www.floodmaps.fema.gov/NFHL/status.shtml"
def test():
page = urllib2.urlopen(url).read()
urlList = []
soup = BeautifulSoup(page)
soup.prettify()
for link in soup.findAll("iframe", src=True):
r = urllib2.urlopen(link['src'])
iFrame = link['src']
print iFrame
def connect_patched(self):
"Connect to a host on a given (SSL) port."
sock = socket.create_connection((self.host, self.port),
self.timeout, self.source_address)
if self._tunnel_host:
self.sock = sock
self._tunnel()
self.sock = ssl.wrap_socket(sock, self.key_file, self.cert_file,
ssl_version=ssl.PROTOCOL_SSLv2)
httplib.HTTPSConnection.connect = connect_patched
test()
Error I get when running this test
urllib2.URLError: urlopen error [Errno 6] _ssl.c:504: TLS/SSL connection has been closed
I am hoping a more experienced coder can see what I have done and tell me if my current methods are the way to go and if so how to get past this final error and parse the datatable properly.
Working Edits with #crmackey
import requests
import os
import zipfile
from pyquery import PyQuery
from requests.packages.urllib3.exceptions import InsecureRequestWarning, InsecurePlatformWarning, SNIMissingWarning
import httplib
httplib.HTTPConnection._http_vsn = 10
httplib.HTTPConnection._http_vsn_str = 'HTTP/1.0'
# disable ssl warnings (we are not verifying SSL certificates at this time...future ehnancement?)
for warning in [SNIMissingWarning, InsecurePlatformWarning, InsecureRequestWarning]:
requests.packages.urllib3.disable_warnings(warning)
def download_zips(out_path):
url = 'http://www.floodmaps.fema.gov/NFHL/status.shtml'
download_prefix = 'https://hazards.fema.gov/femaportal/NFHL'
pq = PyQuery(requests.get(url, verify=False).content) #verify param important for SSL
src = pq.find('iframe').attr('src')
pq = PyQuery(requests.get(src, verify=False).content)
table = pq.find('table')
for a in table.find('a'):
href = a.attrib.get('href')
print href
url = '/'.join([download_prefix, href])
print url
r = requests.get(url, stream=True, verify=False)
out_zip = os.path.join(out_path, href.split('=')[-1])
with open(out_zip, 'wb') as f:
for chunk in r.iter_content(1024 *16): #grab 1KB at a time
if chunk:
f.write(chunk)
print 'downloaded zip: "{}"'.format(href.split('=')[-1])
out_path = r"C:\Users\barr\Desktop\Test"
download_zips(out_path)
All I added was the httplib and changed the HTTPConnection at the top. That allowed to me connect to the site using your script. Now here is the current problem. I am only getting 1 zip file in my out_path, and the zip file is empty. I checked the printed source in the debug window and its showing its trying to download the TERRITORY OF THE VIRGIN ISLAND zip file from the table so it looks like its trying but its not downloading anything. After it outputs that one empty zip file the script finishes and brings up no further error messages. I temporarily removed your lines that unzipped the file because they were returning an error since the folder was empty.
I was able to get the zip files downloaded by using the requests module and also opted for using PyQuery instead of Beautiful Soup. I think the issue you were facing has to do with the SSL cert validation, where the requests module will allow you to skip checking the certificate if you set the verify parameter to False.
The function below will download all the zip files and unzip them, from there, you can import the shapefiles into your geodatabase:
import requests
import os
import zipfile
from pyquery import PyQuery
from requests.packages.urllib3.exceptions import InsecureRequestWarning, InsecurePlatformWarning, SNIMissingWarning
# disable ssl warnings (we are not verifying SSL certificates at this time...future ehnancement?)
for warning in [SNIMissingWarning, InsecurePlatformWarning, InsecureRequestWarning]:
requests.packages.urllib3.disable_warnings(warning)
def download_zips(out_path):
url = 'http://www.floodmaps.fema.gov/NFHL/status.shtml'
download_prefix = 'https://hazards.fema.gov/femaportal/NFHL'
pq = PyQuery(requests.get(url, verify=False).content) #verify param important for SSL
src = pq.find('iframe').attr('src')
pq = PyQuery(requests.get(src, verify=False).content)
table = pq.find('table')
for a in table.find('a'):
href = a.attrib.get('href')
url = '/'.join([download_prefix, href])
r = requests.get(url, stream=True, verify=False)
out_zip = os.path.join(out_path, href.split('=')[-1])
with open(out_zip, 'wb') as f:
for chunk in r.iter_content(1024 *16): #grab 1KB at a time
if chunk:
f.write(chunk)
print 'downloaded zip: "{}"'.format(href.split('=')[-1])
# do more stuff like unzip?
unzipped = out_zip.split('.zip')[0]
with zipfile.Zipfile(out_zip, 'r') as f:
f.extractall(unzipped)
I'm trying to access an authenticated site using a cookies.txt file (generated with a Chrome extension) with Python Requests:
import requests, cookielib
cj = cookielib.MozillaCookieJar('cookies.txt')
cj.load()
r = requests.get(url, cookies=cj)
It doesn't throw any error or exception, but yields the login screen, incorrectly. However, I know that my cookie file is valid, because I can successfully retrieve my content using it with wget. Any idea what I'm doing wrong?
Edit:
I'm tracing cookielib.MozillaCookieJar._really_load and can verify that the cookies are correctly parsed (i.e. they have the correct values for the domain, path, secure, etc. tokens). But as the transaction is still resulting in the login form, it seems that wget must be doing something additional (as the exact same cookies.txt file works for it).
MozillaCookieJar inherits from FileCookieJar which has the following docstring in its constructor:
Cookies are NOT loaded from the named file until either the .load() or
.revert() method is called.
You need to call .load() method then.
Also, like Jermaine Xu noted the first line of the file needs to contain either # Netscape HTTP Cookie File or # HTTP Cookie File string. Files generated by the plugin you use do not contain such a string so you have to insert it yourself. I raised appropriate bug at http://code.google.com/p/cookie-txt-export/issues/detail?id=5
EDIT
Session cookies are saved with 0 in the 5th column. If you don't pass ignore_expires=True to load() method all such cookies are discarded when loading from a file.
File session_cookie.txt:
# Netscape HTTP Cookie File
.domain.com TRUE / FALSE 0 name value
Python script:
import cookielib
cj = cookielib.MozillaCookieJar('session_cookie.txt')
cj.load()
print len(cj)
Output:
0
EDIT 2
Although we managed to get cookies into the jar above they are subsequently discarded by cookielib because they still have 0 value in the expires attribute. To prevent this we have to set the expire time to some future time like so:
for cookie in cj:
# set cookie expire date to 14 days from now
cookie.expires = time.time() + 14 * 24 * 3600
EDIT 3
I checked both wget and curl and both use 0 expiry time to denote session cookies which means it's the de facto standard. However Python's implementation uses empty string for the same purpose hence the problem raised in the question. I think Python's behavior in this regard should be in line with what wget and curl do and that's why I raised the bug at http://bugs.python.org/issue17164
I'll note that replacing 0s with empty strings in the 5th column of the input file and passing ignore_discard=True to load() is the alternate way of solving the problem (no need to change expiry time in this case).
I tried taking into account everything that Piotr Dobrogost had valiantly figured out about MozillaCookieJar but to no avail. I got fed up and just parsed the damn cookies.txt myself and now all is well:
import re
import requests
def parseCookieFile(cookiefile):
"""Parse a cookies.txt file and return a dictionary of key value pairs
compatible with requests."""
cookies = {}
with open (cookiefile, 'r') as fp:
for line in fp:
if not re.match(r'^\#', line):
lineFields = line.strip().split('\t')
cookies[lineFields[5]] = lineFields[6]
return cookies
cookies = parseCookieFile('cookies.txt')
import pprint
pprint.pprint(cookies)
r = requests.get('https://example.com', cookies=cookies)
This worked for me:
from http.cookiejar import MozillaCookieJar
from pathlib import Path
import requests
cookies = Path('/Users/name/cookies.txt')
jar = MozillaCookieJar(cookies)
jar.load()
requests.get('https://path.to.site.com', cookies=jar)
<Response [200]>
I tried editing Tristan answer to add some info to it but it seems SO edit q is full therefore, I am writing this answer, since, I have struggled real bad on using existing cookies with python request.
First, get the cookies from the Chrome. Easiest way would be to use an extension called 'cookies.txt'
https://chrome.google.com/webstore/detail/get-cookiestxt/bgaddhkoddajcdgocldbbfleckgcbcid/related
After downloading those cookies, use the below code to make sure that you are able to parse the file without any issues.
import re, requests, pprint
def parseCookieFile(cookiefile):
"""Parse a cookies.txt file and return a dictionary of key value pairs
compatible with requests."""
cookies = {}
with open (cookiefile, 'r') as fp:
for line in fp:
if not re.match(r'^\#', line):
lineFields = re.findall(r'[^\s]+', line) #capturing anything but empty space
try:
cookies[lineFields[5]] = lineFields[6]
except Exception as e:
print (e)
return cookies
cookies = parseCookieFile('cookies.txt') #replace the filename
pprint.pprint(cookies)
Next, use those cookies with python request
x = requests.get('your__url', verify=False, cookies=cookies)
print (x.content)
This should save your day from going on different SO posts and trying those cookielib and other methods which never worked for me.
I finally found a way to make it work (I got the idea by looking at curl's verbose ouput): instead of loading my cookies from a file, I simply created a dict with the required value/name pairs:
cd = {'v1': 'n1', 'v2': 'n2'}
r = requests.get(url, cookies=cd)
and it worked (although it doesn't explain why the previous method didn't). Thanks for all the help, it's really appreciated.
import requests
MSA_request=""">G1
MGCTLSAEDKAAVERSKMIDRNLREDGEKAAREVKLLLL
>G2
MGCTVSAEDKAAAERSKMIDKNLREDGEKAAREVKLLLL
>G3
MGCTLSAEERAALERSKAIEKNLKEDGISAAKDVKLLLL"""
q={"stype":"protein","sequence":MSA_request,"outfmt":"clustal"}
r=requests.post("http://www.ebi.ac.uk/Tools/msa/clustalo/",data=q)
This is my script, I send this request to website, but the result looks like I did nothing, web service didn't receive my request. This method used to be fine with other website, maybe this page with a pop window to ask cookie agreement?
The form on the page you are referring to has a separate URL, namely
http://www.ebi.ac.uk/Tools/services/web_clustalo/toolform.ebi
you can verify this with a DOM inspector in your browser.
So in order to proceed with requests, you need to access the right page
r=requests.post("http://www.ebi.ac.uk/Tools/services/web_clustalo/toolform.ebi",data=q)
this will submit a job with your input data, it doesn't return the result directly. To check the results, it's necessary to extract the job ID from the previous response and then generate another request (with no data) to
http://www.ebi.ac.uk/Tools/services/web_clustalo/toolresult.ebi?jobId=...
However, you should definitely check whether this programatic access is compatible with the TOS of that website...
Here is an example:
from lxml import html
import requests
import sys
import time
MSA_request=""">G1
MGCTLSAEDKAAVERSKMIDRNLREDGEKAAREVKLLLL
>G2
MGCTVSAEDKAAAERSKMIDKNLREDGEKAAREVKLLLL
>G3
MGCTLSAEERAALERSKAIEKNLKEDGISAAKDVKLLLL"""
q={"stype":"protein","sequence":MSA_request,"outfmt":"clustal"}
r = requests.post("http://www.ebi.ac.uk/Tools/services/web_clustalo/toolform.ebi",data = q)
tree = html.fromstring(r.text)
title = tree.xpath('//title/text()')[0]
#check the status and get the job id
status, job_id = map(lambda s: s.strip(), title.split(':', 1))
if status != "Job running":
sys.exit(1)
#it might take some time for the job to finish
time.sleep(10)
#download the results
r = requests.get("http://www.ebi.ac.uk/Tools/services/web_clustalo/toolresult.ebi?jobId=%s" % (job_id))
#prints the full response
#print(r.text)
#isolate the alignment block
tree = html.fromstring(r.text)
alignment = tree.xpath('//pre[#id="alignmentContent"]/text()')[0]
print(alignment)