issues in extracting text from PyPDF2

issues in extracting text from PyPDF2 - python

I am trying to read a pdf using pypdf2 in python.
I have been using below code to read pdf.
import PyPDF2
filename = r'./abc.pdf'
pdf = open(filename,'rb')
pdfReader= PyPDF2.PdfFileReader(pdf)
page = pdfReader.getPage(0)
text = page.extractText()
print(text)
It displays text if I execute it on python command line.
It does not display text if I save it as script (with .py file extension) and run script.

Related

How to read pdf text and write to a document while conserving formatting using python?

The code below prints the pdf text perfectly in the console using print(page.extract_text()). I can copy the text from the console and paste it into word and the formatting is conserved. However, if I use docx to save the text to a word document like document.save("word.docx"), the formatting is changed and I have to manually correct it. Does anyone know how to save the pdf text while conserving the formatting?
from docx import Document
import os
import glob
document = Document("C:\path\Test.docx")
path = "C:\\path\\"
os.chdir(path)
for pdf_file in glob.glob("*.pdf"):
print(pdf_file)
with pdfplumber.open(pdf_file) as pdf:
for page in pdf.pages:
print(page.extract_text())
text = page.extract_text()
content = document.add_paragraph(text)
document.save("word.docx")

How to attach a file to an existing pdf?

I have a pdf created with latex and I have to attach a file to this pdf.
I´ve tried this code but the output is a pdf blanked but indeed the file was attached. I have also tried to merge the output file and the original file but the output detaches the file.
import PyPDF2 as pdf2
from PyPDF2 import dfFileReader,PdfFilewriter
output = pdf2.PdfFileWriter()
with open ("test.pdf", "rb") as f:
input_pdf = pdf2.PdfFileReader(f)
output.appendPagesFromReader(input_pdf)
with open("test.xml", "rb") as xmlFile:
output.addAttachment("test.xml", xmlFile.read())
with open ("test2.pdf", "ab") as f:
output.write(f)
Solution (partial):
I wrote this in the latex file that generates the pdf, and the file was embedded/attached to the pdf. But I couldn´t find the way to do it in python.
\documentclass{article}
\usepackage{embedfile}
\begin{document}
\embedfile{test-ansi.tex}
text \end{document}

PyPDF2 does not always extract text from pdf file

I am trying to extract text from a PDF file using the python package PyPDF2. Below is the code I wrote for it and it is working just fine for most pdf files. However for this file I get the following error and no text gets extracted whatsoever. Here the warning: PdfReadWarning: Xref table not zero-indexed. ID numbers for objects will be corrected. [pdf.py:1736]
import PyPDF2
# Import and read pdf file
file_name = 'path/to/file.pdf'
pdfFileObj = open(file_name, 'rb')
pdfReader = PyPDF2.PdfFileReader(pdfFileObj, strict=True)
# Extract text from pdf file
pageObj = pdfReader.getPage(1)
text = pageObj.extractText()
print(text)

Convert the page data extracted from pdf file into csv file using PyPDF2

I have extracted data from a pdf file, but I am unable to convert it into a csv file
import PyPDF2 as PDF
PDFfile = open("path", "rb")
pdfread = PDF.PdfFileReader(PDFfile)
page = pdfread.getpage(0)
Page_content = page.extractText()
After extracting the text from a particular page, I want to export it to a csv file. How to do this?

Convert PDF to .docx with Python

I'm trying very hard to find the way to convert a PDF file to a .docx file with Python.
I have seen other posts related with this, but none of them seem to work correctly in my case.
I'm using specifically
import os
import subprocess
for top, dirs, files in os.walk('/my/pdf/folder'):
for filename in files:
if filename.endswith('.pdf'):
abspath = os.path.join(top, filename)
subprocess.call('lowriter --invisible --convert-to doc "{}"'
.format(abspath), shell=True)
This gives me Output[1], but then, I can't find any .docx document in my folder.
I have LibreOffice 5.3 installed.
Any clues about it?
Thank you in advance!

I am not aware of a way to convert a pdf file into a Word file using libreoffice.
However, you can convert from a pdf to a html and then convert the html to a docx.
Firstly, get the commands running on the command line. (The following is on Linux. So you may have to fill in path names to the soffice binary and use a full path for the input file on your OS)
soffice --convert-to html ./my_pdf_file.pdf
then
soffice --convert-to docx:'MS Word 2007 XML' ./my_pdf_file.html
You should end up with:
my_pdf_file.pdf
my_pdf_file.html
my_pdf_file.docx
Now wrap the commands in your subprocess code

I use this for multiple files
####
from pdf2docx import Converter
import os
# # # dir_path for input reading and output files & a for loop # # #
path_input = '/pdftodocx/input/'
path_output = '/pdftodocx/output/'
for file in os.listdir(path_input):
cv = Converter(path_input+file)
cv.convert(path_output+file+'.docx', start=0, end=None)
cv.close()
print(file)

Below code worked for me.
import win32com.client
word = win32com.client.Dispatch("Word.Application")
word.visible = 1
pdfdoc = 'NewDoc.pdf'
todocx = 'NewDoc.docx'
wb1 = word.Documents.Open(pdfdoc)
wb1.SaveAs(todocx, FileFormat=16) # file format for docx
wb1.Close()
word.Quit()

My approach does not follow the same methodology of using subsystems. However this one does the job of reading through all the pages of a PDF document and moving them to a docx file. Note: It only works with text; images and other objects are usually ignored.
#Description: This python script will allow you to fetch text information from a pdf file
#import libraries
import PyPDF2
import os
import docx
mydoc = docx.Document() # document type
pdfFileObj = open('pdf/filename.pdf', 'rb') # pdffile loction
pdfReader = PyPDF2.PdfFileReader(pdfFileObj) # define pdf reader object
# Loop through all the pages
for pageNum in range(1, pdfReader.numPages):
pageObj = pdfReader.getPage(pageNum)
pdfContent = pageObj.extractText() #extracts the content from the page.
print(pdfContent) # print statement to test output in the terminal. codeline optional.
mydoc.add_paragraph(pdfContent) # this adds the content to the word document
mydoc.save("pdf/filename.docx") # Give a name to your output file.

I have successfully done this with pdf2docx :
from pdf2docx import parse
pdf_file = "test.pdf"
word_file = "test.docx"
parse(pdf_file, word_file, start=0, end=None)

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

issues in extracting text from PyPDF2 - python

Related

How to read pdf text and write to a document while conserving formatting using python?

How to attach a file to an existing pdf?

PyPDF2 does not always extract text from pdf file

Convert the page data extracted from pdf file into csv file using PyPDF2

Convert PDF to .docx with Python

Categories

Resources