I am retriving data form an API that returns a JSON object with the following structure:
{
"status":"OK",
"text":{
"doc_id":647508,
"bill_id":502329,
"date":"2012-05-23",
"type":"Enrolled",
"mime":"application/rtf",
"doc":"MIME 64 Encoded Document”
}
}
where the encoded document is a PDF file. Here is an example of the PDFs I am working with: https://legiscan.com/WA/text/HB1531/id/1473804/Washington-2017-HB1531-Introduced.pdf. I am trying to read the encoded file into a string object. So far I have been able to do so by converting the response into bytes and then reading the pdf :
import PyPDF2
import base64
with open("sample.pdf", "wb") as f:
inp_str = response.json()['text']['doc'].encode('utf-8')
f.write(base64.b64decode(inp_str))
with open('sample.pdf', "rb") as f:
pdf = PyPDF2.PdfFileReader(f)
It feels that this is not a very efficient way to process multiple documents. I have tried following a related question (Is it possible to input pdf bytes straight into PyPDF2 instead of making a PDF file first):
read_pdf = PyPDF2.PdfFileReader(io.BytesIO(response.json()['text']['doc'].encode()))
but I always get the error PdfReadError: Could not read malformed PDF file
Is there any way to do this?
Related
For a personal project i would like to convert files of any types (pdf, png, mp3...) to bytes type and then reconvert the bytes file to the original type.
I made the first part, but i need help for the second part.
In the following example, I read a .jpg file as bytes and i save its content in the "content" object. Now i would like to reconvert "content" (bytes type) to the original .jpg type.
test_file = open("cadenas.jpg", "rb")
content = test_file.read()
content
b'\xff\xd8\xff\xe0\x00\x10JFIF\x00\x0 ...
Could you help me ?
Regards
Pictures uses Base64 encoding.
This should do the job.
import base64
test_file = open('cadenas.jpg', 'rb')
content = test_file.read()
content_encode = base64.encodestring(content)
content_decode = base64.decodebytes(content_encode)
result_file = open('cadenas2.jpg', 'wb')
result_file.write(content_decode)
my jupyter notebook is saving a dataframe(having styles) to an excel file. then I have created a link to download this excel file:
df=df.to_excel('ABC.xlsx', index=True)
filename ='ABC.xlsx'
file_link = "<a href='{href}' download='ABC.xlsx'> Download ABC.xlsx</a>"
html = HTML(file_link.format(href=filename))
dispaly(html)
but when i click on link-Download ABC.xlsx, I am getting- Failed: Network error.
On the contrary it is working fine when i am downloading CSV file the same way
Adding csv code, there is some base64 encoding added in csv code without which csv code is also not working:
def func(df,title="Download csv file",filename="ABC.csv"):
csv=df.to_csv(index=True)
b64 =base64.b64encode(csv.encode())
payload=b64.decode()
html = "{title}"
html = html.format(payload=payload,title=title,filename=filename)
return HTML(html)
i tried editing this function for excel file:
def func(df,title="Download excel file",filename="ABC.xlsx"):
xls=df.to_excel("xyz.xlsx",index=True)
b64 =base64.b64encode(xls.encode())
payload=b64.decode()
html = "{title}"
html = html.format(payload=payload,title=title,filename=filename)
return HTML(html)
for excel code it giving error: 'NoneType' object has no attribute 'encode'
In you csv code, you use csv=df.to_csv(index=True), according to docs
If path_or_buf is None, returns the resulting csv format as a string.
Otherwise returns None.
here you didn't specify path_or_buf, so return value is csv content. this is why you can download csv.
Now to_excel doc desn't say it has any return value. so your payload don't contain anything at all.
To solve, you can manually open file again and read as base64 format string:
def file_to_base64(file):
#file should be the actual file name you wrote
with open(file, "rb") as image_file:
encoded_string = base64.b64encode(image_file.read())
return encoded_string.decode()
replace the two lines
b64 =base64.b64encode(xls.encode())
payload=b64.decode()
with:
payload = file_to_base64(file)
I am using PyPDF2 to generate a PDF, and I would like to upload this PDF to Cloudinary, which accepts images as IO objects.
The example from their docs: cloudinary.uploader.upload(open('/tmp/image1.jpg', 'rb'))
In my application, I instantiate a PdfFileWriter and add pages:
output = PyPDF2.PdfFileWriter()
output.addPage(page)
Then I can save the generated PDF locally:
outputStream = file(destination_file_name, "wb")
output.write(outputStream)
outputStream.close()
But obviously I'm trying to avoid this. Instead I'm trying to send an IO object to cloudinary:
image_StringIO_object = StringIO.StringIO()
output.write(image_StringIO_object)
cloudinary.uploader.upload(image_StringIO_object,
api_key=CLOUDINARY_API_KEY,
api_secret=CLOUDINARY_API_SECRET,
cloud_name=CLOUDINARY_CLOUD_NAME,
format="PDF")
This returns the error:
Empty file
If instead I try to pass the value of the StringIO object:
cloudinary.uploader.upload(image_StringIO_object.getvalue(),
...)
I get the error:
file() argument 1 must be encoded string without null bytes, not str
Got the answer from Cloudinary support:
The result from getvalue() on the StringIO object needs to be base64 encoded and prepended with a tag:
out = StringIO.StringIO()
output.write(out)
cloudinary.uploader.upload("data:image/pdf;base64," +
base64.b64encode(out.getvalue()))
I am fetching html source code of many pages from one website, I need to convert it into json object and combine with other elements in json doc. . I have seen many questions on same topic but non of them were helpful.
My code:
url = "https://totalhash.cymru.com/analysis/?1ce201cf28c6dd738fd4e65da55242822111bd9f"
htmlContent = requests.get(url, verify=False)
data = htmlContent.text
print("data",data)
jsonD = json.dumps(htmlContent.text)
jsonL = json.loads(jsonD)
ContentUrl='{ \"url\" : \"'+str(urls)+'\" ,'+"\n"+' \"uid\" : \"'+str(uniqueID)+'\" ,\n\"page_content\" : \"'+jsonL+'\" , \n\"date\" : \"'+finalDate+'\"}'
above code gives me unicode type, however, when I put that output in jsonLint it gives me invalid json error. Can somebody help me understand how can I convert the complete html into a json objet?
jsonD = json.dumps(htmlContent.text) converts the raw HTML content into a JSON string representation.
jsonL = json.loads(jsonD) parses the JSON string back into a regular string/unicode object. This results in a no-op, as any escaping done by dumps() is reverted by loads(). jsonL contains the same data as htmlContent.text.
Try to use json.dumps to generate your final JSON instead of building the JSON by hand:
ContentUrl = json.dumps({
'url': str(urls),
'uid': str(uniqueID),
'page_content': htmlContent.text,
'date': finalDate
})
The correct way to convert HTML source code to a JSON file on the local system is as follows:
import json
import codecs
# Load the JSON file by specifying the location and filename
with codecs.open(filename="json_file.json", mode="r", encoding="utf-8") as jsonf:
json_file = json.loads(jsonf.read())
# Load the HTML file by specifying the location and filename
with codecs.open(filename="html_file.html", mode='r', encoding="utf-8") as htmlf:
html_file = htmlf.read()
# Chose the key name where the HTML source code will live as a string
json_file['Key1']['Key2'] = html_file
# Dump the dictionary to JSON object and save it in a specific location
json_object = json.dumps(json_file, indent=4)
with codecs.open(filename="final_json_file.json", mode="w", encoding="utf-8") as ojsonf:
ojsonf.write(json_object)
Next, open the JSON file in your editor.
Press CTRL + H, and replace \n or \t characters by '' (nothing!).
Now you can parse your HTML file with codecs.open() function and do the operations.
I have a valid JSON (checked using Lint) in a text file.
I am loading the json as follows
test_data_file = open('jsonfile', "r")
json_str = test_data_file.read()
the_json = json.loads(json_str)
I have verified the json data in file on Lint and it shows it as valid. However the json.loads throws
ValueError: No JSON object could be decoded
I am a newbie to Python so not sure how to do it the right way. Please help
(I assume it has something to do it encoding the string to unicode format from utf-8 as the data in file is retrieved as a string)
I tried with open('jsonfile', 'r') and it works now.
Also I did the following on the file
json_newfile = open('json_newfile', 'w')
json_oldfile = open('json_oldfile', 'r')
old_data = json_oldfile.read()
json.dump(old_data, json_newfile)
and now I am reading the new file.