Python urlfetch returning bad data - python

Note: This question has been edited to reflect new information, including the title which used to be 'How to store a PDF in Amazon S3 with Python Boto library'.
I'm trying to save a PDF file using urlfetch (if the url was put into a browser, it would prompt a 'save as' dialog), but there's some kind of encoding issue.
There are lots of unknown characters showing up in the urlfetch result, as in:
urlfetch.fetch(url).text
The result has chars like this: s�*��E����
Whereas the same content in the actual file looks like this: sÀ*ÿ<81>E®<80>Ùæ
So this is presumably some sort of encoding issue, but I'm not sure how to fix it. The version of urlfetch I'm using is 1.0
For what it's worth, the PDF I've been testing with is here: http://www.revenue.ie/en/tax/it/forms/med1.pdf

I switched to urllib instead of urlfetch e.g.
import urllib
result = urllib.urlopen(url)
...and everything seems to be fine.

Related

How to encode a video response in python?

In a request that I made I received a byte response and I know it is a response of a video. and it's status code was 200. And I don't know how to use it. I mean I tried to encode it into utf-8 and then save it to a file but it is not playable. media players is unable to read it's content here's the request that I made
import requests
resp = requests.get('https://bcboltsony-a.akamaihd.net/media/v1/hls/v4/aes128/5182475815001/4ded6ac4-6f8b-4da2-8194-db2391d5e331/164fe5c5-15a3-4997-b4c6-7dd4b95f9c57/92410c6d-c565-4341-8650-1d40a795ece2/5x/segment1.ts?akamai_token=exp=1589337578~acl=/media/v1/hls/v4/aes128/5182475815001/4ded6ac4-6f8b-4da2-8194-db2391d5e331/164fe5c5-15a3-4997-b4c6-7dd4b95f9c57/92410c6d-c565-4341-8650-1d40a795ece2/*~hmac=bf9745f2a9b51c04d59eb9955de20dcf1b4c8c7e434ad0bdd639f2d80fa10ecc')
open('E:/video.mp4', 'wb').write(bytes(resp.text, encoding='utf-8'))
how to convert this response to a watchable format
Try using wget which can help download files 10x easier.
Here is a simple code with your situation:
import wget
url = "https://bcboltsony-a.akamaihd.net/media/v1/hls/v4/aes128/5182475815001/4ded6ac4-6f8b-4da2-8194-db2391d5e331/164fe5c5-15a3-4997-b4c6-7dd4b95f9c57/92410c6d-c565-4341-8650-1d40a795ece2/5x/segment1.ts?akamai_token=exp=1589337578~acl=/media/v1/hls/v4/aes128/5182475815001/4ded6ac4-6f8b-4da2-8194-db2391d5e331/164fe5c5-15a3-4997-b4c6-7dd4b95f9c57/92410c6d-c565-4341-8650-1d40a795ece2/*~hmac=bf9745f2a9b51c04d59eb9955de20dcf1b4c8c7e434ad0bdd639f2d80fa10ecc"
wget.download(url, 'c:/users/Yourname/downloads/video.mp4')
If this does not work the problem of encoding may be on the url's side.
Your code is absolutely right.But note that:
If you open this page in your explorer,you will find it is a .ts file instead of .mp4 file.
Also,if you download it in the explorer directly, you also couldn't play it directly.In my PC, it also reminds me it has been damaged.
If you search it in the internet, .ts file is encrypted(In the page of your url,the way it encrypt is AES128).Maybe you need to take some measures.
Replace your code with the below code. I hope it will work :).
open('E:/video.mp4', 'wb').write(resp.content)

Can't enter txt file contents as query string using python

I can’t get python to open a link that uses the contents of a .txt file as a query string. I’m working on Python 3.7.0 and was able to write code that opens the website and checks a string that I’ve input directly, as well as open my text file and print the contents, but when I try to make the text file’s contents a query it throws an error.
I added lines that print the link that I would need to open to make sure it comes out correctly and that works fine, I can copy and paste it into my browser and get a correct result.
Here's the code I used
And a screenshot of the error I get
I'm a total beginner at this so any suggestions or explanations would be lifesavers!
The error is with the string being passed to the urlopen(). When it tries to open the link you get an HTTP 400 : Bad request error which means that something is wrong with the link you provided. The text possibly has spaces and you aren't escaping the characters properly. Here is the link which could help you.
Alternatively, you could also use the Python Requests library.
(Please include the code in the question rather than screenshot)
Check out the http you’re requesting does ‘actually’ exists. Moreover, I’m not sure how’s your .txt file looks like, but reexamine the code (.read() part) to make sure the data you wanted to add as a query is being handled correctly.

Getting file from URL that triggers a download in Python

I have a URL in a web analytics reporting platform that basically triggers a download/export of the report you're looking at. The downloaded file itself is a CSV, and the link that triggers the download uses several attached parameters to define things like the fields in the report. What I am looking to do is download the CSV that the link triggers a download of.
I'm using Python 3.6, and I've been told that the server I'll be deploying on does not support Selenium or any webkits like PhantomJS. Has anyone successfully accomplished this?
If the file is a CSV file, you might want to consider downloading it's content directly, by using the requests module, something like this.
import requests
session=requests.Session()
information=session.get(#the link of the page here)
Then You can decode the information and read the contents as you wish using the CSV module, something like this (the csv module should be imported):
decoded_information=information.content.decode('utf-8')
data=decoded_information.splitlines()
data=csv.DictReader(data)
You can use a for loop to access each row in the data as you wish using the column headings as dictionary keys like so:
for row in data:
itemdate=row['Date']
...
Or you can save the decoded contents by writing them to a file with something like this:
decoded_information=information.content.decode('utf-8')
file=open("filename.csv", "w")
file.write(decoded_information)
file.close
A couple of links with documentation on the CSV module is provided here just in case you haven't used it before:
https://docs.python.org/2/library/csv.html
http://www.pythonforbeginners.com/systems-programming/using-the-csv-module-in-python/
Hope this helps!

Python downloading PDF with urllib2 creates corrupt document

Last week I defined a function to download pdfs from a journal website. I successfully downloaded several pdfs using:
import urllib2
def pdfDownload(url):
response=urllib2.urlopen(url)
expdf=response.read()
egpdf=open('ex.pdf','wb')
egpdf.write(expdf)
egpdf.close()
I tried this function out with:
pdfDownload('http://pss.sagepub.com/content/26/1/3.full.pdf')
At the time, this was how the URLs on the journal Psychological Science were formatted. The pdf downloaded just fine.
I then went to write some more code to actually generate the URL lists and name the files appropriately so I could download large numbers of appropriately named pdf documents at once.
When I came back to join my two scripts together (sorry for non-technical language; I'm no expert, have just taught myself the basics) the formatting of URLs on the relevant journal had changed. Following the previous URL takes you to a page with URL 'http://journals.sagepub.com/doi/pdf/10.1177/0956797614553009'. And now the pdfDownload function doesn't work anymore (either with the original URL or new URL). It creates a pdf which cannot be opened "because the file is not a supported file type or has been damaged".
I'm confused as to me it seems like all has changed is the formatting of the URLs, but actually something else must have changed to result in this? Any help would be hugely appreciated.
The problem is that the new URL points to a webpage--not the original PDF. If you print the value of "expdf", you'll get a bunch of HTML--not the binary data you're expecting.
I was able to get your original function working with a small tweak--I used the requests library to download the file instead of urllib2. requests appears to pull the file with the loader referenced in the html you're getting from your current implementation. Try this:
import requests
def pdfDownload(url):
response=requests.get(url)
expdf=response.content
egpdf=open('ex.pdf','wb')
egpdf.write(expdf)
egpdf.close()
If you're using Python 3, you already have requests; if you're using Python 2.7, you'll need to pip install requests.

Send PDF file path to client to download after covnersion in WeasyPrint

In my Django app, I'm using WeasyPrint to convert html report to pdf. I need to send the converted file back to client so they can download it. But I don't see any code on WeasyPrint site where we can get the path of saved file or know in any way where the file was saved.
If I hard code the path, like, D:/Python/Workspace/report.pdf and try to open it via javascript, it simply says that the address was not understood.
What is a better way to apporach this issue?
My code:
HTML(string=htmlContent).write_pdf(fileName,
stylesheets=[CSS(filename='css/bootstrap.min.css')])
This is all the code related to WeasyPrint that generated PDF file.
You didn't even bothered to post the relevant code, but anyway:
If you're using the Python API, you either specify the output file path when calling weasyprint.HTML().write_pdf() or get the PDF back as bytestring, as documented here - and then you can either manually save it to a file somewhere you can redirect your user to or just pass the bytestring to django's HttpResponse.
If you're using the commandline (which would be quite surprising from a Django app...), you have to specify the output path too...
IOW : I don't really understand your problem. FWIW, the whole documentation is here : http://weasyprint.readthedocs.io/en/latest/ - and there's a quite obvious link on the project's homepage (which is how I found it FWIW).
EDIT : now you posted your actual code: the answer is written in plain in the FineManual(tm):
Parameters: target – A filename, file-like object, or None
Returns:
The PDF as byte string if target is not provided or None, otherwise None
(the PDF is written to target.)
IOW, either you choose to pass the filename for the generated to be generated and serve this file to the user, or you can just pass your Django HttpResponse as target, cf this example in Django's doc.

Categories

Resources