I am trying to convert thousands of PDF files to HTML. I was able to convert this PDF file to this HTML file using the following code:
def convertPDFToHtml():
command = 'pdf2txt.py -o output.html -t html test.pdf'
os.system(command)
I want to be able to parse the HTML file so that I can extract different texts from it. The problem now is that the output HTML file is missing a lot of text from the original file.
Is there a better to convert the PDF file and parse the HTML text ?
This is possibly a similar problem as discussed here, unless you specifically want to generate HTML files. But even so, you could first extract the text from the PDFs as simple unformatted text, parse it, and then generate the HTMLs.
Related
I'm trying to convert a large HTML to PDF. So far I tried two approaches with the same no desired result.
I execute this on the terminal:
wkhtmltopdf my_html_file.html my_pdf_file.pdf
On the other hand, I tried the conversion inside a Python script:
import pdfkit
with open('my_html_file.html') as f:
pdfkit.from_file(f, 'my_pdf_file.pdf')
In both cases the output file generated (my_pdf_file.pdf) only has 1 page and does not contain all the content from the HTML file.
im trying to extract elements from a number of different HTML files using findall and put them into a new HTML file. so far i have
news = ['16-10-2017.html', '17-10-2017.html', '18-10-2017.html', '19-10-2017.html', '21-10,2017.html', '22-10-2017.html']
def extracted():
raw_news = open(news, 'r', encoding = 'UTF-8')
im creating a function that will be able to read these files, extract specific parts so i can put them into a new html file but im not sure if this code for reading the files is correct. how would i be able to extract elements from these files.
You need to loop over the list, open one file (python would ask for a 'string' and say that it got a 'list' instead). Once you are into the loop, you can operate over the file and maybe save the text you want to find and put it in some other data structure.
Change your working directory to the directory where you have these files and then:
def extracted(news):
for page in news:
raw_news = open(news[page], 'r', encoding = 'UTF-8')
# Now you have raw_news from one page and you can operate over it
# Once the loop is over, the same code would run on the next html file
I've a problem with extracting text out of .docx after removing table.
The docx files I'm dealing with contain a lot of tables that I would like to get rid of before extracting the text.
I first use docx2html to convert a docx file to html, and then use BeautifulSoup to remove the table tag and extract the text.
from docx2html import convert
from bs4 import BeautifulSoup
...
temp = convert(FileToConvert)
soup = BeautifulSoup(temp)
for i in range(0,len(soup('table'))):
soup.table.decompose()
Text = soup.get_text()
While this process works and produces what I need, there is some efficiency issue with docx2html.convert(). Since .docx files are in infact .xml files, would it be possible to skip the the procedure of converting docx into html and just extract text from the xml after removing tables.
docx files are not just xml files but rather a zipped xml based format, so you won't be able to pass a docx file directly to BeautifulSoup. The format seems pretty simple though as the zipped docx contains a file called word/document.xml which is probably the xml file you want to parse. You can use Python's zipfile module to extract this file and pass its contents directly to BeautfulSoup:
import sys
import zipfile
from bs4 import BeautifulSoup
with zipfile.ZipFile(sys.argv[1], 'r') as zfp:
with zfp.open('word/document.xml') as fp:
soup = BeautifulSoup(fp.read(), 'xml')
print soup
However, you might also want to look at https://github.com/mikemaccana/python-docx, which might do a lot of what you want already. I haven't tried it so I can't vouch for its suitability for your specific use-case.
I want to add text or annotation in the exsting pdf file to interpret some key words.
At first I tried the pyPdf & reportlib to merge t he original pdf file & new generated interpretion pdf file, but it doesn't work. Because the original file keep out all the words of interpretation pdf and make new pdf file invisible. Don't know why? If I test to merge two new generated interpretion pdf file into one, it works well.
So I am thinking to try to use another way to insert just annotation into existing pdf file by python. Anybody have related experience can give me suggestion? Thanks!
Adding a watermark to existing pdf using PyPDF certainly works for me:
template = PdfFileReader(file("template.pdf", "rb")) #template pdf
output=PdfFileWriter() #writer for the merged pdf
for i in range(new.getNumPages()):
page=template.getPage(i)
page.mergePage(new.getPage(i))
output.addPage(page)
Read my other SO answer for reference.
Read my complete article to know more about pdf generation and merging in python.
sorry,.. i'am a noob in python..
I need to create a pdf file, without using an existing pdf files.. (pure create a new one)
i have googling, and lot of them is merge 2 pdf or create a new file copies from a particular page in another file... what i want to achieve is make a report page (in chart), but for first step or the simple one "how to insert a string into my pdf file ? (hello world mybe)"..
this is my code to make a new pdf file with a single blankpage
from pyPdf import PdfFileReader, PdfFileWriter
op = PdfFileWriter()
# here to add blank page
op.addBlankPage(200,200)
#how to add string here, and insert it to my blank page ?
ops = file("document-output.pdf", "wb")
op.write(ops)
ops.close()
You want "pisa" or "reportlab" for generating arbitrary PDF documents, not "pypdf".
http://www.xhtml2pdf.com/doc/pisa-en.html
http://www.reportlab.org
Also check out the pyfpdf library. I've used the php port of this library for a few years and it's quite flexible, allowing you to work with flowable text, lines, rectangles, and images.
http://code.google.com/p/pyfpdf