Python using xhtml2pdf to print webpage into PDF

Python using xhtml2pdf to print webpage into PDF - python

I am trying to using xhtml2pdf to print webpage into local disk PDF files. There's an example found as below.
It runs and doesn't return error. However it doesn't convert the webpage but only a sentence. in this case, only 'http://www.yahoo.com/' is written into the PDF file.
How can I actually convert the web page into PDF?
from xhtml2pdf import pisa
sourceHtml = 'http://www.yahoo.com/'
outputFilename = "test.pdf"
def convertHtmlToPdf(sourceHtml, outputFilename):
resultFile = open(outputFilename, "w+b")
pisaStatus = pisa.CreatePDF(sourceHtml,resultFile)
resultFile.close()
return pisaStatus.err
if __name__=="__main__":
pisa.showLogging()
convertHtmlToPdf(sourceHtml, outputFilename)

xhmlt2pdf is not going to work with all the websites, for one, it is not working for yahoo.com. But the reason it is not working here is you are not providing the actual HTML file to pisa but rather providing the URL, you want to fetch the HTML first, for example using urllib2:
url=urllib2.urlopen('http://sheldonbrown.com/web_sample1.html')
srchtml=url.read()
pisa.showLogging()
convertHtmlToPdf(srchtml, outputFilename)
And it will work. That is a very simple sample HTML.

thanks to CT Zhu's help. just putting down the workable one, for reference:
from xhtml2pdf import pisa
import urllib2
url=urllib2.urlopen('http://sheldonbrown.com/web_sample1.html')
sourceHtml=url.read()
pisa.showLogging()
outputFilename = "test555.pdf"
def convertHtmlToPdf(sourceHtml, outputFilename):
resultFile = open(outputFilename, "w+b")
pisaStatus = pisa.CreatePDF(sourceHtml,resultFile)
resultFile.close()
return pisaStatus.err
if __name__=="__main__":
pisa.showLogging()
convertHtmlToPdf(sourceHtml, outputFilename)

Related

Exception: No parsed pages. Please parse page first

I am trying to read a whole pdf file that is more then 250 pages. for that first i am converting my pdf to docx thorough the pdf2docx library.
here is a code;
from docx import Document
document = Document()
document.save('file.docx')
url = file_path #(google drive url where file was uploaded)
response = requests.get(url)
my_raw_data = response.content
with open("my_pdf.pdf", 'wb') as my_data:
my_data.write(my_raw_data)
open_pdf_file = open("my_pdf.pdf", 'rb')
cv = Converter(open_pdf_file)
cv.convert("roshni.docx")
Parse=parser.from_file("file.docx")
data=[]
for i in (Parse['content'].strip().split('\n')):
if len(i.split())<5:
pass
else:
data.append(i)
Text=data[1:-1]
But I am not able to read the file. getting error like "Exception: No parsed pages. Please parse page first."
How to solve this issue ? how to read a whole pdf using python ?

Python - Scraping a PDF file from a URL

I want to scrape pdf files from this site
https://www.sigmaths.net/Reader.php?var=manuels/ph/physique_pilote_7b.pdf
I tried this code for that but it doesn't work. Can anybody tell me why, please?
res = requests.get('https://www.sigmaths.net/Reader.php?var=manuels/ph/physique_7b.pdf')
with open('C:\\Users\\sioud\\Desktop\\Manuels scolaires TN\\1\\test.pdf', 'wb') as f:
f.write(ress.content)

res = requests.get('https://www.sigmaths.net/manuels/ph/physique_7b.pdf',stream=True)
with open('test.pdf', 'wb') as f:
f.write(res.content)
your url is pointing to a reader https://www.sigmaths.net/Reader.php?var=manuels/ph/physique_7b.pdf, remove the 'reader.php?var= for the actual pdf

You can also use urlretrieve.
Check out my solution code.
from urllib.request import urlretrieve
pdfurl = u"https://www.sigmaths.net/manuels/ph/physique_7b.pdf";
urlretrieve(pdfurl, "test.pdf")
And you will find the required pdf download under the name test.pdf

Convert PDF to .docx with Python

I'm trying very hard to find the way to convert a PDF file to a .docx file with Python.
I have seen other posts related with this, but none of them seem to work correctly in my case.
I'm using specifically
import os
import subprocess
for top, dirs, files in os.walk('/my/pdf/folder'):
for filename in files:
if filename.endswith('.pdf'):
abspath = os.path.join(top, filename)
subprocess.call('lowriter --invisible --convert-to doc "{}"'
.format(abspath), shell=True)
This gives me Output[1], but then, I can't find any .docx document in my folder.
I have LibreOffice 5.3 installed.
Any clues about it?
Thank you in advance!

I am not aware of a way to convert a pdf file into a Word file using libreoffice.
However, you can convert from a pdf to a html and then convert the html to a docx.
Firstly, get the commands running on the command line. (The following is on Linux. So you may have to fill in path names to the soffice binary and use a full path for the input file on your OS)
soffice --convert-to html ./my_pdf_file.pdf
then
soffice --convert-to docx:'MS Word 2007 XML' ./my_pdf_file.html
You should end up with:
my_pdf_file.pdf
my_pdf_file.html
my_pdf_file.docx
Now wrap the commands in your subprocess code

I use this for multiple files
####
from pdf2docx import Converter
import os
# # # dir_path for input reading and output files & a for loop # # #
path_input = '/pdftodocx/input/'
path_output = '/pdftodocx/output/'
for file in os.listdir(path_input):
cv = Converter(path_input+file)
cv.convert(path_output+file+'.docx', start=0, end=None)
cv.close()
print(file)

Below code worked for me.
import win32com.client
word = win32com.client.Dispatch("Word.Application")
word.visible = 1
pdfdoc = 'NewDoc.pdf'
todocx = 'NewDoc.docx'
wb1 = word.Documents.Open(pdfdoc)
wb1.SaveAs(todocx, FileFormat=16) # file format for docx
wb1.Close()
word.Quit()

My approach does not follow the same methodology of using subsystems. However this one does the job of reading through all the pages of a PDF document and moving them to a docx file. Note: It only works with text; images and other objects are usually ignored.
#Description: This python script will allow you to fetch text information from a pdf file
#import libraries
import PyPDF2
import os
import docx
mydoc = docx.Document() # document type
pdfFileObj = open('pdf/filename.pdf', 'rb') # pdffile loction
pdfReader = PyPDF2.PdfFileReader(pdfFileObj) # define pdf reader object
# Loop through all the pages
for pageNum in range(1, pdfReader.numPages):
pageObj = pdfReader.getPage(pageNum)
pdfContent = pageObj.extractText() #extracts the content from the page.
print(pdfContent) # print statement to test output in the terminal. codeline optional.
mydoc.add_paragraph(pdfContent) # this adds the content to the word document
mydoc.save("pdf/filename.docx") # Give a name to your output file.

I have successfully done this with pdf2docx :
from pdf2docx import parse
pdf_file = "test.pdf"
word_file = "test.docx"
parse(pdf_file, word_file, start=0, end=None)

Python: downloading xml files in batch returns a damaged zip file

Drawing inspiration from this post, I am trying to download a bunch of xml files in batch from a website:
import urllib2
url='http://ratings.food.gov.uk/open-data/'
f = urllib2.urlopen(url)
data = f.read()
with open("C:\Users\MyName\Desktop\data.zip", "wb") as code:
code.write(data)
The zip file is created within seconds, but as I attempt to access it, an error window comes up:
Windows cannot open the folder.
The Compressed (zipped) Folder "C:\Users\MyName\Desktop\data.zip" is invalid.
What am I doing wrong here?

you are not opening file handles inside the zip file:
import urllib2
from bs4 import BeautifulSoup
import zipfile
url='http://ratings.food.gov.uk/open-data/'
fileurls = []
f = urllib2.urlopen(url)
mainpage = f.read()
soup = BeautifulSoup(mainpage, 'html.parser')
tablewrapper = soup.find(id='openDataStatic')
for table in tablewrapper.find_all('table'):
for link in table.find_all('a'):
fileurls.append(link['href'])
with zipfile.ZipFile("data.zip", "w") as code:
for url in fileurls:
print('Downloading: %s' % url)
f = urllib2.urlopen(url)
data = f.read()
xmlfilename = url.rsplit('/', 1)[-1]
code.writestr(xmlfilename, data)

You are doing nothing to encode this as zip file. If instead you choose to open it in a plain text editor such as notepad it should show you the raw xml.

Unable to open file object created in Django

I'm developing an application that runs on an Apache server with Django framework. My current script works fine when it runs on the local desktop (without Django). The script downloads all the images from a website to a folder on the desktop. However, when I run the script on the server a file object is just create by Django that apparently has something in it (should be google's logo), however, I can't open up the file. I also create an html file, updated image link locations, but the html file gets created fine, I'm assuming because it's all text, maybe? I believe I may have to use a file wrapper somewhere, but I'm not sure. Any help is appreciated, below is my code, Thanks!
from django.http import HttpResponse
from bs4 import BeautifulSoup as bsoup
import urlparse
from urllib2 import urlopen
from urllib import urlretrieve
import os
import sys
import zipfile
from django.core.servers.basehttp import FileWrapper
def getdata(request):
out = 'C:\Users\user\Desktop\images'
if request.GET.get('q'):
#url = str(request.GET['q'])
url = "http://google.com"
soup = bsoup(urlopen(url))
parsedURL = list(urlparse.urlparse(url))
for image in soup.findAll("img"):
print "Old Image Path: %(src)s" % image
#Get file name
filename = image["src"].split("/")[-1]
#Get full path name if url has to be parsed
parsedURL[2] = image["src"]
image["src"] = '%s\%s' % (out,filename)
print 'New Path: %s' % image["src"]
# print image
outpath = os.path.join(out, filename)
#retrieve images
if image["src"].lower().startswith("http"):
urlretrieve(image["src"], outpath)
else:
urlretrieve(urlparse.urlunparse(parsedURL), out) #Constructs URL from tuple (parsedURL)
#Create HTML File and writes to it to check output (stored in same directory).
html = soup.prettify("utf-8")
with open("output.html", "wb") as file:
file.write(html)
else:
url = 'You submitted nothing!'
return HttpResponse(url)

My problem had to do with storing the files on the desktop. I stored the files in the DJango workspace folder, changed the paths, and it worked for me.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Python using xhtml2pdf to print webpage into PDF - python

Related

Exception: No parsed pages. Please parse page first

Python - Scraping a PDF file from a URL

Convert PDF to .docx with Python

Python: downloading xml files in batch returns a damaged zip file

Unable to open file object created in Django

Categories

Resources