For pages with tabular data in landscape format, the words in the HTML outcome overlap. For pages in portrait formats, the conversion is succesful. Any ideas how to fix that?
[Here is an example with the converted pdf to html in landscape format]
[1]: https://i.stack.imgur.com/twbzw.png
[2]: https://i.stack.imgur.com/Ln56P.png
import ntpath
from pathlib import Path
import fitz
doc = fitz.open(in_path) # open document
out = open(in_path + ".html", "wb") # open text output
for page in doc: # iterate the document pages
page.read_contents()
text = page.get_text('html', clip = None).encode("utf8")
out.write(text) # write text of page
out.write(bytes((12,))) # write page delimiter (form feed 0x0C)
out.close()
Related
The code below prints the pdf text perfectly in the console using print(page.extract_text()). I can copy the text from the console and paste it into word and the formatting is conserved. However, if I use docx to save the text to a word document like document.save("word.docx"), the formatting is changed and I have to manually correct it. Does anyone know how to save the pdf text while conserving the formatting?
from docx import Document
import os
import glob
document = Document("C:\path\Test.docx")
path = "C:\\path\\"
os.chdir(path)
for pdf_file in glob.glob("*.pdf"):
print(pdf_file)
with pdfplumber.open(pdf_file) as pdf:
for page in pdf.pages:
print(page.extract_text())
text = page.extract_text()
content = document.add_paragraph(text)
document.save("word.docx")
I want to first extract all the text from 1 pdf file and store it into one text file.
Here is my code:
import PyPDF2
from pathlib import Path
with Path('C:/Users/Lui/Desktop/Test/file1.pdf').open(mode='rb') as pdf_file, open('Extracted/extractPDF.txt', 'w') as text_file:
read_pdf = PyPDF2.PdfFileReader(pdf_file)
number_of_pages = read_pdf.getNumPages()
print(number_of_pages)
for page_number in range(number_of_pages): # use xrange in Py2
page = read_pdf.getPage(page_number)
page_content = page.extractText()
print(page_content)
text_file.write(page_content)
The pdf looks like this:
However, the text file created looks different in comparison with missing words and spacing:
What am I doing wrong? My goal is to then loop through 1,000 PDF's so I'm trying to get 1 example working first.
Try using pdftotext
import pdftotext
# Load your PDF
with open(filename, "rb") as f:
pdf = pdftotext.PDF(f)
# If it's password-protected
#with open("secure.pdf", "rb") as f:
# pdf = pdftotext.PDF(f, "secret")
# How many pages?
#print(len(pdf))
# Iterate over all the pages
#for page in pdf:
# print(page)
data = "\n\n".join(pdf)
# Read all the text into one string
print(data)
This package works far better and should help you out.
I'm trying to get Table of Contents from a PDF. I'm using PyMuPDF for that purpose. But it only extracts ToC if the PDF consists of Bookmarks. Otherwise it only results in an empty list.
def get_Table_Of_Contents(doc):
toc = doc.getToC()
return toc
toc= get_Table_Of_Contents(file)
toc
Usually TOC is represented like a regular text on a page.
Try pdfreader to extract texts and/or PDF "markdown".
Here is a sample code extracting all the above from a page:
from pdfreader import SimplePDFViewer, PageDoesNotExist
fd = open(your_pdf_file_name, "rb")
viewer = SimplePDFViewer(fd)
# navigate to TOC
viewer.navigate(toc_page_number)
viewer.render()
pdf_markdown = viewer.canvas.text_content
plain_text = "".join(viewer.canvas.strings)
then you can parse plain_text or pdf_markdown as regular strings.
Convert pdf to html using pdf-html converter. You can parse html toextract whatever data you want using parser like beautifulsoup.
I am spliting a pdf file by ranges. My code works fine, but the problem is that the new pdf file does not preserve the format of the original pdf.
How can I splita pdf file without losing the original format?
from PyPDF2 import PdfFileReader, PdfFileWriter
#split range
pgi=30 #start
pgf=37 #end
pdf_document = "test.pdf"
pdf = PdfFileReader(pdf_document)
pdf_writer = PdfFileWriter()
for page in range(pgi-1,pgf):
current_page = pdf.getPage(page)
pdf_writer.addPage(current_page)
with open(f'test-{pgi}-{pgf}.pdf', "wb") as out:
pdf_writer.write(out)
PS: I don't have this problem when I use Adobe or MasterPdfEditor software.
The image below shows the original and new pdf format.
out format pdf
I want to extract text content from PPT and PDF files in python.
While using PPTX works fine for extracting the text, using PyPDF2 extracts the text content from charts and tables as well from PDF when using extract_text() which I don't want.
I have tried different things but can't figure out a way to achieve this. Is there any way that this can be done? Pfb the code for the same.
import ntpath
import os
import glob
import PyPDF2
import pandas as pd from pptx import Presentation
df_header=pd.DataFrame(columns=['Document_Name', 'Document_Type', 'Page_No', 'Text', 'Report Name'])
df_header.to_csv('Downloads\\\\FinalSample.csv', mode='a', header=True)
for eachfile in glob.glob("D:\\CP US People-Centric Hub (19-SCP-3063)\\Reports\\/*\\\\/*"):
file1 = eachfile.split("\\")
report_name = file1[3]
if eachfile.endswith(".pptx"):
data=[]
prs = Presentation(eachfile)
for slide in prs.slides:
text_runs = ''
slide_num = prs.slides.index(slide) + 1
for shape in slide.shapes:
if not shape.has_text_frame:
continue
for paragraph in shape.text_frame.paragraphs:
text_runs = text_runs + ' ' + paragraph.text
data.append([ntpath.basename(eachfile), 'PPT', slide_num, text_runs,report_name])
df_ppt=pd.DataFrame(data)
df_ppt.to_csv('Downloads\\\\FinalSample.csv', mode='a', header=False)
elif eachfile.endswith(".pdf"):
data1=[]
pdfFileObj = open(eachfile, 'rb')
pdfReader = PyPDF2.PdfFileReader(pdfFileObj)
outlines = pdfReader.getOutlines()
for pageNum in range(pdfReader.numPages):
data1.append([ntpath.basename(eachfile), 'PDF', pageNum + 1,pdfReader.getPage(pageNum).extractText(),report_name])
df_pdf=pd.DataFrame(data1)
df_pdf.to_csv('Downloads\\\\FinalSample.csv', mode='a', header=False)
pdfFileObj.close()
No, sorry: extracting just the body text from a PDF and omitting figure titles, footnotes, headers, footers, page numbers etc. isn't possible in general. This is because "body text" isn't really a defined concept in the PDF format.
You could however dig into the library and add some heuristics targeting figure captions e.g. to discard blocks of text that follow a large gap without text, or are too short (but what about titles?), or perhaps where the font size is much smaller than the mean.