File reading not working in Brython/Python - python

My requirement: Read contents from a input type="file" with ID= "rtfile1" and write it to a textarea with ID- "rt1"
Based on the documentation on [https://brython.info/][1] I tried to read a file but it fails with this error:
Access to XMLHttpRequest at 'file:///C:/fakepath/requirements.txt' from origin 'http://example.com:8000' has been blocked by CORS policy: Cross origin requests are only supported for protocol schemes: http, data, chrome, chrome-extension, https.
I tried following two Brython codes, both of them failed with the same aforementioned error.
Code 1:
def file_read(ev):
doc['rt1'].value = open(doc['rtfile1'].value).read()
doc["rtfile1"].bind("input", file_read)
Code 2:
def file_read(ev):
def on_complete(req):
if req.status==200 or req.status==0:
doc['rt1'].value = req.text
else:
doc['rt1'].value = "error "+req.text
def err_msg():
doc['rt1'].value = "server didn't reply after %s seconds" %timeout
timeout = 4
def go(url):
req = ajax.ajax()
req.bind("complete", on_complete)
req.set_timeout(timeout, err_msg)
req.open('GET', url, True)
req.send()
print('Triggered')
go(doc['rtfile1'].value)
doc["rtfile1"].bind("input", file_read)
Any help would be greatly appreciated. Thanks!!! :)

It's not related to Brython (you would have the same result with the equivalent Javascript), but to the way you tell the browser which file you want to upload.
If you select the file by an HTML tag such as
<input type="file" id="rtfile1">
the object referenced by doc['rtfile1'] in the Brython code has an attribute value, but it is not the file path or url, it's a "fakepath" built by the browser (as you can see in the error message), and you can't use it as an argument of the Brython function open(), or as a url to send an Ajax request to; if you want to use the file url, you should enter it in a basic input tag (without type="file").
It is better to select the file with type="file", but in this case the object doc['rtfile1'] is a FileList object, described in the DOM's Web API, whose first element is a File object. Reading its content is unfortunately not as simple as with open(), but here is a working example:
from browser import window, document as doc
def file_read(ev):
def onload(event):
"""Triggered when file is read. The FileReader instance is
event.target.
The file content, as text, is the FileReader instance's "result"
attribute."""
doc['rt1'].value = event.target.result
# Get the selected file as a DOM File object
file = doc['rtfile1'].files[0]
# Create a new DOM FileReader instance
reader = window.FileReader.new()
# Read the file content as text
reader.readAsText(file)
reader.bind("load", onload)
doc["rtfile1"].bind("input", file_read)

Related

Should I switch from "urllib.request.urlretrieve(..)" to "urllib.request.urlopen(..)"?

1. Deprecation problem
In Python 3.7, I download a big file from a URL using the urllib.request.urlretrieve(..) function. In the documentation (https://docs.python.org/3/library/urllib.request.html) I read the following just above the urllib.request.urlretrieve(..) docs:
Legacy interface
The following functions and classes are ported from the Python 2 module urllib (as opposed to urllib2). They might become deprecated at some point in the future.
2. Searching an alternative
To keep my code future-proof, I'm on the lookout for an alternative. The official Python docs don't mention a specific one, but it looks like urllib.request.urlopen(..) is the most straightforward candidate. It's at the top of the docs page.
Unfortunately, the alternatives - like urlopen(..) - don't provide the reporthook argument. This argument is a callable you pass to the urlretrieve(..) function. In turn, urlretrieve(..) calls it regularly with the following arguments:
block nr.
block size
total file size
I use it to update a progressbar. That's why I miss the reporthook argument in alternatives.
3. urlretrieve(..) vs urlopen(..)
I discovered that urlretrieve(..) simply uses urlopen(..). See the request.py code file in the Python 3.7 installation (Python37/Lib/urllib/request.py):
_url_tempfiles = []
def urlretrieve(url, filename=None, reporthook=None, data=None):
"""
Retrieve a URL into a temporary location on disk.
Requires a URL argument. If a filename is passed, it is used as
the temporary file location. The reporthook argument should be
a callable that accepts a block number, a read size, and the
total file size of the URL target. The data argument should be
valid URL encoded data.
If a filename is passed and the URL points to a local resource,
the result is a copy from local file to new file.
Returns a tuple containing the path to the newly created
data file as well as the resulting HTTPMessage object.
"""
url_type, path = splittype(url)
with contextlib.closing(urlopen(url, data)) as fp:
headers = fp.info()
# Just return the local path and the "headers" for file://
# URLs. No sense in performing a copy unless requested.
if url_type == "file" and not filename:
return os.path.normpath(path), headers
# Handle temporary file setup.
if filename:
tfp = open(filename, 'wb')
else:
tfp = tempfile.NamedTemporaryFile(delete=False)
filename = tfp.name
_url_tempfiles.append(filename)
with tfp:
result = filename, headers
bs = 1024*8
size = -1
read = 0
blocknum = 0
if "content-length" in headers:
size = int(headers["Content-Length"])
if reporthook:
reporthook(blocknum, bs, size)
while True:
block = fp.read(bs)
if not block:
break
read += len(block)
tfp.write(block)
blocknum += 1
if reporthook:
reporthook(blocknum, bs, size)
if size >= 0 and read < size:
raise ContentTooShortError(
"retrieval incomplete: got only %i out of %i bytes"
% (read, size), result)
return result
4. Conclusion
From all this, I see three possible decisions:
I keep my code unchanged. Let's hope the urlretrieve(..) function won't get deprecated anytime soon.
I write myself a replacement function behaving like urlretrieve(..) on the outside and using urlopen(..) on the inside. Actually, such function would be a copy-paste of the code above. It feels unclean to do that - compared to using the official urlretrieve(..).
I write myself a replacement function behaving like urlretrieve(..) on the outside and using something entirely different on the inside. But hey, why would I do that? urlopen(..) is not deprecated, so why not use it?
What decision would you take?
The following example uses urllib.request.urlopen to download a zip file containing Oceania's crop production data from the FAO statistical database. In that example, it is necessary to define a minimal header, otherwise FAOSTAT throws an Error 403: Forbidden.
import shutil
import urllib.request
import tempfile
# Create a request object with URL and headers
url = “http://fenixservices.fao.org/faostat/static/bulkdownloads/Production_Crops_Livestock_E_Oceania.zip”
header = {'User-Agent': 'Mozilla/5.0 (X11; Linux x86_64) '}
req = urllib.request.Request(url=url, headers=header)
# Define the destination file
dest_file = tempfile.gettempdir() + '/' + 'crop.zip'
print(f“File located at:{dest_file}”)
# Create an http response object
with urllib.request.urlopen(req) as response:
# Create a file object
with open(dest_file, "wb") as f:
# Copy the binary content of the response to the file
shutil.copyfileobj(response, f)
Based on https://stackoverflow.com/a/48691447/2641825 for the request part and https://stackoverflow.com/a/66591873/2641825 for the header part, see also urllib's documentation at https://docs.python.org/3/howto/urllib2.html

Unexpected query string parameters error when page is not in base directory with Cherrypy

I am currently building a website using cherrypy and have run into an error when attempting to open a page with a query string in the url http://localhost:8080/protected/my_page.html?pk=3e4e285c-ed33-403e-a7a1-6b79b5f8356d included I get an error:
Traceback (most recent call last):
File "C:\Python27\lib\site-packages\cherrypy\_cprequest.py", line 670, in respond
response.body = self.handler()
File "C:\Python27\lib\site-packages\cherrypy\lib\encoding.py", line 221, in __call__
self.body = self.oldhandler(*args, **kwargs)
File "C:\Python27\lib\site-packages\cherrypy\_cpdispatch.py", line 66, in __call__
raise sys.exc_info()[1]
HTTPError: (404, 'Unexpected query string parameters: pk')
I currently host this page in a /protected/ directory which requires authentication. But, if I move a copy of the page back to my root dir (/wwwroot) and then attempt to load the exact same url (besides the /protected/) http://localhost:8080/my_page.html?pk=3e4e285c-ed33-403e-a7a1-6b79b5f8356d my page loads successfully and all the fields are populated with the data I expect that matches my query. That then though leaves that page available for anyone to access rather than require authentication.
The page currently should load and if the query string is in the url it should get the value from this line of python code pk = context.get("pk", default = "") and then there is a DB lookup using the cx_Oracle python module and the pk variable to find the matching record to display on the page. I tried commenting out the pk variable assignment from my page to see if it would work but I am greeted with the same error.
Besides just moving the page to my base directory to make it work, does anyone have any idea how to resolve this error?

Upload a text file to a site for analysis using MechanicalSoup

I'm trying to get a text file to TreeTagger Online to get it analyzed and get the link to the resulting file to download.
import mechanicalsoup
browser = mechanicalsoup.Browser()
homePage = browser.get("http://cental.fltr.ucl.ac.be/treetagger/")
formPart = homePage.soup.select("form[name=treetagger_form]")[0]
formPart.select("[name=file_to_tag]")[0]["name"]=open('test.txt', 'rb')
result = browser.post(formPart, homePage.url)
This gives me the following error:
: (, UnicodeEncodeError('ascii', u'No connection adapters were found for \'\n\n\n\n\n Texte \xe0 \xe9tiqueter : \n \n\n\n\n\n\n\n\n\n\n\n\n\n\n\'', 216, 217, 'ordinal not in range(128)'))
How should I proceed to get my file on site (using MechanicalSoup or another module)?
01/04/19 Edit
Even though I did not manage to get #Rolando Urquiza's answer to work on my machine, I was able to get the thing done from his suggestions.
import mechanicalsoup
browser = mechanicalsoup.Browser()
homePage = browser.get("http://cental.fltr.ucl.ac.be/treetagger/")
formPart = homePage.soup.select("form[name=treetagger_form]")[0]
form=mechanicalsoup.Form(formPart)
form.set('file_to_tag', 'test.txt')
upload=browser.submit(form,url="http://cental.fltr.ucl.ac.be/treetagger/")
Thanks #Rolando Urquiza
According to the documentation of MechanicalSoup, you can upload a file using the set function on a mechanicalsoup.Form instance, see here. For example, this is how you can use it:
import mechanicalsoup
browser = mechanicalsoup.StatefulBrowser()
browser.get("http://cental.fltr.ucl.ac.be/treetagger/")
form = browser.select_form()
form.set('file_to_tag', 'test.txt')
result = browser.submit_selected()

Saving Image from URL using Python Requests - URL type error

Using the following code:
with open('newim','wb') as f:
f.write(requests.get(repr(url)))
where the url is:
url = 'data:image/png;base64,iVBORw0KGgoAAAANSUhEUgAAAFoAAAArCAYAAAD41p9mAAAAzUlEQVR42u3awQ4DIQhFUf7/p9tNt20nHQGl5yUuh4c36BglgoiIiIiIiGiVHq+RGfvdiGG+lxKonGiWd4vvKZNd5V/u2zXRO953c2jx3bGiMrewLt+PgbJA/xJ3RS5dvl9PEdXLduK3baeOrKrc1bcF9MnLP7WqgR4GOjtOl28L6AlHtLSqBhpooIEGGmiggQYaaKCBBhpodx3H3XW4vQN6HugILyztoL0Zhlfw9G4tfR0FfR0VnTw6lQoT0XtXmMxfdJPuALr0x5Pp+wT35KKWb6NaVgAAAABJRU5ErkJggg=='
I get the following error:
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "C:\Python33\lib\site-packages\requests\api.py", line 69, in get
return request('get', url, params=params, **kwargs)
File "C:\Python33\lib\site-packages\requests\api.py", line 50, in request
response = session.request(method=method, url=url, **kwargs)
File "C:\Python33\lib\site-packages\requests\sessions.py", line 465, in request
resp = self.send(prep, **send_kwargs)
File "C:\Python33\lib\site-packages\requests\sessions.py", line 567, in send
adapter = self.get_adapter(url=request.url)
File "C:\Python33\lib\site-packages\requests\sessions.py", line 641, in get_adapter
raise InvalidSchema("No connection adapters were found for '%s'" % url)
I have seen other posts with what, at first glance, appears to be a similar problem but I haven't had any luck just adding 'https://' or anything like that...I seriously want to avoid having to do this in webdriver+Autoit or something because I have to do a similar exercise for thousands of images.
There seems to be a problem with your understanding of the concept of embedded images. The url you have posted is, actually, what your browser returns when you select 'View Image' or 'Copy Image Location' (or something similar, depending on the browser) from the context menu, and formally is called a data URI.
It is not an http url pointing to an image, and you can not use it to retrieve actual images from any server: this is exactly what requests points out in the error message.
So, how do we get these images?
The following script will handle this task:
import requests
from lxml import html
import binascii as ba
i = 0
url="<Page URL goes here>" #Ex: http://server/dir/images.html
page = requests.get(url)
struct = html.fromstring(page.text)
images = struct.xpath('//img/#src')
for img in images:
i += 1
ext = img.partition('data:image/')[2].split(';')[0]
with open('newim'+str(i)+'.'+ext,'wb') as f:
f.write(ba.a2b_base64(img.partition('base64,')[2]))
print("Done")
To run it you will need to install, along with requests, the lxml library which can be found here.
Here follows a short description of how the script functions:
First it requests the url from the server and, after it gets the server's response, it stores it in a Response object (page).
Then it utilizes html.fromstring() from lxml to transform the "textified" content of page into a tree-structure which can be processed by commands utilizing XPath syntax, like this one: images = struct.xpath('//img/#src').
The result is a list containing the contents of the src attribute of every image in the page. In this case (embedded images) these are the data URIs.
Then, for every image in the list, it first gets the image type (which will be used as the newim's extension), using partition() and split() and stores it in ext. Then it converts the base64 encoded data to binary (using a2b_base64() from binascii module) and writes the output to the file.
As a small demo, save this html code (as, eg, images.html) somewhere in your server
<h1>Images</h1>
<img src="data:image/png;base64,iVBORw0KGgoAAAANSUhEUgAAAEcAAAAmCAYAAACWGIbgAAACKElEQVR4nO2aPWsUQRiAnznSKYI/IbBJoTkCEhDuuhQnMY2VhdjYbAoLO9N4VfwBlyKFFlYKYqNN9iCFnQeCCHrJFdkFf4IgKcO9FnLJJtzOfs3OGJgHtpi5nZl3H+abUyIieObSch3A/4yXo8HL0eDlaPByNHg5GrwcDV6OhnJyPt0bK6XYGhqMYLiFUir1dNlNDNafpmz8kkskU1gHZPaEUX6pfGKZ9nkOSPe9xJfz6AxmmTWpHn+2nDe8S1doVk5KAqFcqC4Kz9uq05CB+OfL0VRsRM7H3s9MAfFAWnCtVluG4p8/5zyRR/JPHBIPaMH1gqO0AAny/eBt5s/BMqdwd5Z8/XKX0lOQofjtr1bJPgs77BV+f/SB/aYm6BzsyzmMxlM4KV5gxCRuLhwd9uX8PhiXLXL0p/zIMoFlOQnyix/pnO76Ru6Hf/k8DJqLKZursUM+PHbSdSzLiWGHb3bbrM7V6DmOsCxnCfqslS62soyLSceynAC1yGrZUkUm7SawP6xu9trpZJGV6PYNJx3HgZyV++1y2/kOt5aaC0eHfTnBJqd9nhZ+v/OQTSf9xslqFaDu9B6fJS/vYZJjFuDrLBm+eOZmTFFRTu3t/IO99rTPNgCjCReOTvGEs7NXGPFqo1ZLcykcf6W7ESO3dDk3gWauG2vFX+myK/2cf1hF0jd/INCRQV3zhuJXIv5fFln444MGL0eDl6PBy9Hg5WjwcjR4ORr+Aq7+02kTcdF1AAAAAElFTkSuQmCC" />
<br />
<img src="data:image/png;base64,iVBORw0KGgoAAAANSUhEUgAAAFoAAAArCAYAAAD41p9mAAAAzUlEQVR42u3awQ4DIQhFUf7/p9tNt20nHQGl5yUuh4c36BglgoiIiIiIiGiVHq+RGfvdiGG+lxKonGiWd4vvKZNd5V/u2zXRO953c2jx3bGiMrewLt+PgbJA/xJ3RS5dvl9PEdXLduK3baeOrKrc1bcF9MnLP7WqgR4GOjtOl28L6AlHtLSqBhpooIEGGmiggQYaaKCBBhpodx3H3XW4vQN6HugILyztoL0Zhlfw9G4tfR0FfR0VnTw6lQoT0XtXmMxfdJPuALr0x5Pp+wT35KKWb6NaVgAAAABJRU5ErkJggg=="></img>
<br />
<img src="data:image/jpg;base64,/9j/4AAQSkZJRgABAQAAAQABAAD/2wBDAAEBAQEBAQEBAQEBAQEBAQEBAQEBAQEBAQEBAQEBAQEBAQEBAQEBAQEBAQEBAQEBAQEBAQEBAQEBAQEBAQEBAQH/2wBDAQEBAQEBAQEBAQEBAQEBAQEBAQEBAQEBAQEBAQEBAQEBAQEBAQEBAQEBAQEBAQEBAQEBAQEBAQEBAQEBAQEBAQH/wAARCAAhADYDASIAAhEBAxEB/8QAHAABAQACAgMAAAAAAAAAAAAAAAoICQILBAYH/8QAKxAAAQQDAAICAQIGAwAAAAAABQMEBgcBAggACQoWFRQhERMXNjh2d7W2/8QAGQEAAgMBAAAAAAAAAAAAAAAAAAQBAgMF/8QAIxEAAgMAAgMBAQEAAwAAAAAAAwQBAgUSEwAGERQjIRUiJP/aAAwDAQACEQMRAD8Av48eYVd0Vd2Ba1RphuOuxQHFUyEvnUgktmP+aIn0vIC4IWPWXSi0cBT+dxiFRjUg5xtkwbKRyYEVG2rdENgEumu7dRu/DrvK7+hOs/YbYPQVx2helhf0dokEpOrcncosOV7hxU3sbI8TodlxUyUQEMN3LrYcJSd6jx+HC2jRBLVTbGYzZnR1W8uIle6aDejchJi0mXXzHnxlXoObVkZmESoTByrsCINhiVrgqrZ2X/iOYLRme6pn87PrQccajM9ohStQ9ycb1IEJqOUgAWAmpagbMANJahv38eSO/Kc9k9vcyRrk/irn2yZTU8z68myq9s2BX5JeO2GApUNIY5GtgcQlzTfUhE304PnnWjo0H/llPxcUICP1KQ0wRbPcQfe1LYt6C+lvWH0j67K/ifOsbk2LLg3RlMVIFZV/W/TFc1G5rDcQPuaMxpMeDn0xahLJnbQHYsoFlZuFMGUZC3PbEhzHdCiZhMyiYxKK5+l7I16sm/e0WH/yGejVh9lqs9cL5irjCmbdihGGO+mmSydAoAtoXMEtSsJrUs1qK+vz7KZCvEVvwXIzZYAy3njZ9rPzdbREAlBAkIs2n6pvp/8Akug8eaUfZnzb2d7Oqtq3n3lTpDbjPmC268Xsi+elAbNeWWVNwsgGsvodBwmFxycwM59XlDEgRO3AaezGKCyMe+vRRi5lwsvNourCrUlOdvfHg95vPfLkCujewgV2WdQgwkyiah2MQLpOjLbsQjWoxvP65fviY0VNguzyX6BkHD+Xb15L08FozJizVfLsiwgIrO2rhuDJmsaWkbHy5MO5CMaICHWKNpekQTPDVpZiKSfk20mAmkik0kRMjazRhDxj7KV66IUs8Oq/UX0VQIHqA47rmvE1dLZNgJJqCv5gOmDlutqO1con2rHjx48z8189dl/9pSj/AF03/wBY6869r4R/+Q/sP/40pr/3FgeXb9NPOpk4KsL5VragLCl58fIhBJboK8rFpeNxXDwOq3CG2Olc8736UnOUyi2NysbX2r3G7BDOGkp0cOf4tZZvRr6SPal6eLwuKwTK/Ad7wq+YtFojLwYro/oWv5BF8x2WbGm8rAvHnFMvZyBZiMKSJrrEXv11Aw7djt1pYHTZqbrxg/x9o0WzfxVL64xliOWOAyPHz/YBjDWZ/wBinZoJUszaKqUsaYuev52ug2f6euAVF/VmPYMvSsEf/a9Ekn0CHLaI/wA51GA96L1mzJKi/mG0lXg2APyzo0fjvtk9WlqmlVUq9KRuAx8c4/HqbItTdf8AR2ZBMt8O0ElF3i34abxNXZjphRRvonooglnd5vjb6F85HGdgfrX11xnO2S/VmMYxjOc5zlnz3jGMYx++c5z+2MY/fOfKgPb36oa69rtBxWvTM4IVDcdPTRCy6CuYYI0kf0iYJaIokB8gjKj8T9jiElaNmiZYe2LiSDQmMBG2T/fIlUWTw9v71M9Cez3pzje4fZk351gdO8Tj3Z4PRHNlpWdeDfom1ZK/jxOTF7FlVo0ZQGlaVsi8gMKwlAAsds03JhLiShCk9EaKs3+ySaNTI4uKe8rCzPfd72Bt21azScXeVh/9C44mvc0q+NnLslW9mC3ui5NaqsMFVeI9ZXXd9goOGpb9IWwBIhmKGrq5yj2CARL2jjRZpTVU05cmvWIaWsG/JgaYnNuPIo7FHcK8zC7SkI0FrUnKVOs7ClEjJIig4NODVLHkpSZOFyrhNsNFidBj50QIkXSSDRo2VculUk099tdFvG3Ku/sL9lkv983RAfMN5yraGYrr1swudMHkYMG6ph7U+ht2BPhp39FqBisv+wzmaU21PIIvXcflw2fumolnHYSZkGQvvi4X9ovsXqCPcscS2rzFSPPkgQSK36ZtKxrdi1lWY+YEN1AdbMR9f0fYAgVV4/DVhITzjMl/MTM1swCuhgWPR17pNtC8f+PP8jqTGo9Fb79uQ+waFMmgge4a8e96d5TEdMapckWjadRNaGyun28UkqBuK7ExeI9IVm4ErhxqwKLIMVllNOzV0uv7I/skrTNYY0XoyTtTa6uYbWIyDR2jCHxKe4Vmio514kP5kzabtRGdayWsjjVSHletoY8XK+IWcjTXElA4PoJ5gglXxRdswD48wuBt6liRS5AKZpr/AJb6YD3wMH7AqwZFBb1oSGEmjZ+OIsHKLxg/YPEdHDR6ydt91G7po6bqJrtnKCiiK6KmiqW+2m2u2XnNo0asGjViybpNGTJui0aNW+mqSDZq2T1Rbt0UtMY0TSRS00TT01xjXTTXXXXGMYx48XJw7L9XPq526+yYknD7PDnNYis34/OU1iI+/fkfPGBdnWPu4d3Cnb1TaR9nGOzrm8RaacvvCbRFuPz7H3755Hjx48p5fx48ePDw8ePHjw8PHjx48PDz/9k="/>
and point to it in the script: requests.get("http://yourserver/somedir/images.html").
When you run the script you will get the following 3 images:
, , , respectively named newim1.png, newim2.png and newim3.jpg.
As a reminder, do note that this script (in its current form) will only handle embedded images. If you want to process also ordinary linked images, then you have to modify it accordingly (but this is not difficult).
This is an image encoded in base64. Quoting the URL below: "base64 equals to text (string) representation of the image itself".
Read this for a detailed explanation:
http://www.stoimen.com/blog/2009/04/23/when-you-should-use-base64-for-images/
In order to use them you'll have to implement a base64 decoder. Luckily SO already provides you with the answer on how to do it:
Python base64 data decode

Python - downloading a CSV file from a web page with a link

I am trying to download the CSV files from this page, via a python script.
But when I try to access the CSV file directly by links in my browser, an agreement form is displayed. I have to agree to this form before I am allowed to download the file.
The exact URLs to the csv files can't be retrieved. It is a value being sent to backend db which fetches the file - e.g PERIOD_ID=2013-0:
https://www.paoilandgasreporting.state.pa.us/publicreports/Modules/DataExports/ExportProductionData.aspx?PERIOD_ID=2013-0
I've tried urllib2.open() and urllib2.read(), but it leads to the html content of agreement form, not the file content.
How do i write a python code which handles this re-direct and then fetches me the CSV file and let me save on disk ?
You need to set the ASP.NET_SessionId cookie. You can find this by using Chrome's Inspect element option in the context menu, or by using Firefox and the Firebug extension.
With Chrome:
Right-click on the webpage (after you've agreed to the terms) and select Inspect element
Click Resources -> Cookies
Select the only element in the list
Copy the Value of the ASP.NET_SessionId element
With Firebug:
Right-click on the webpage (after you've agreed to the terms), and click *Inspect Element with Firebug
Click Cookies
Copy the Value of the ASP.NET_SessionId element
In my case, I got ihbjzynwfcfvq4nzkncbviou - it might work for you, if not you need to perform the above procedure.
Add the cookie to your request, and download the file using the requests module (based on an answer by eladc):
import requests
cookies = {'ASP.NET_SessionId': 'ihbjzynwfcfvq4nzkncbviou'}
r = requests.get(
url=('https://www.paoilandgasreporting.state.pa.us/publicreports/Modules/'
'DataExports/ExportProductionData.aspx?PERIOD_ID=2013-0'),
cookies=cookies
)
with open('2013-0.csv', 'wb') as ofile:
for chunk in r.iter_content(chunk_size=1024):
ofile.write(chunk)
ofile.flush()
Here's my suggestion, for automatically applying the server cookies and basically mimicking standard client session behavior.
(Shamelessly inspired by #pope's answer 554580.)
import urllib2
import urllib
from lxml import etree
_TARGET_URL = 'https://www.paoilandgasreporting.state.pa.us/publicreports/Modules/DataExports/ExportProductionData.aspx?PERIOD_ID=2013-0'
_AGREEMENT_URL = 'https://www.paoilandgasreporting.state.pa.us/publicreports/Modules/Welcome/Agreement.aspx'
_CSV_OUTPUT = 'urllib2_ProdExport2013-0.csv'
class _MyHTTPRedirectHandler(urllib2.HTTPRedirectHandler):
def http_error_302(self, req, fp, code, msg, headers):
print 'Follow redirect...' # Any cookie manipulation in-between redirects should be implemented here.
return urllib2.HTTPRedirectHandler.http_error_302(self, req, fp, code, msg, headers)
http_error_301 = http_error_303 = http_error_307 = http_error_302
cookie_processor = urllib2.HTTPCookieProcessor()
opener = urllib2.build_opener(_MyHTTPRedirectHandler, cookie_processor)
urllib2.install_opener(opener)
response_html = urllib2.urlopen(_TARGET_URL).read()
print 'Cookies collected:', cookie_processor.cookiejar
page_node, submit_form = etree.HTML(response_html), {} # ElementTree node + dict for storing hidden input fields.
for input_name in ['ctl00$MainContent$AgreeButton', '__EVENTVALIDATION', '__VIEWSTATE']: # Form `input` fields used on the ``Agreement.aspx`` page.
submit_form[input_name] = page_node.xpath('//input[#name="%s"][1]' % input_name)[0].attrib['value']
print 'Form input \'%s\' found (value: \'%s\')' % (input_name, submit_form[input_name])
# Submits the agreement form back to ``_AGREEMENT_URL``, which redirects to the CSV download at ``_TARGET_URL``.
csv_output = opener.open(_AGREEMENT_URL, data=urllib.urlencode(submit_form)).read()
print csv_output
with file(_CSV_OUTPUT, 'wb') as f: # Dumps the CSV output to ``_CSV_OUTPUT``.
f.write(csv_output)
f.close()
Good luck!
[Edit]
On the why of things, I think #Steinar Lima is correct with respect to requiring a session cookie. Though unless you've already visited the Agreement.aspx page and submitted a response via the provider's website, the cookie you copy from the browser's web inspector will only result in another redirect to the Welcome to the PA DEP Oil & Gas Reporting Website welcome page. Which of course eliminates the whole point of having a Python script do the job for you.

Categories

Resources