Extracting PDF text into a text file using Python - Extraction error

Extracting PDF text into a text file using Python - Extraction error - python

I want to first extract all the text from 1 pdf file and store it into one text file.
Here is my code:
import PyPDF2
from pathlib import Path
with Path('C:/Users/Lui/Desktop/Test/file1.pdf').open(mode='rb') as pdf_file, open('Extracted/extractPDF.txt', 'w') as text_file:
read_pdf = PyPDF2.PdfFileReader(pdf_file)
number_of_pages = read_pdf.getNumPages()
print(number_of_pages)
for page_number in range(number_of_pages): # use xrange in Py2
page = read_pdf.getPage(page_number)
page_content = page.extractText()
print(page_content)
text_file.write(page_content)
The pdf looks like this:
However, the text file created looks different in comparison with missing words and spacing:
What am I doing wrong? My goal is to then loop through 1,000 PDF's so I'm trying to get 1 example working first.

Try using pdftotext
import pdftotext
# Load your PDF
with open(filename, "rb") as f:
pdf = pdftotext.PDF(f)
# If it's password-protected
#with open("secure.pdf", "rb") as f:
# pdf = pdftotext.PDF(f, "secret")
# How many pages?
#print(len(pdf))
# Iterate over all the pages
#for page in pdf:
# print(page)
data = "\n\n".join(pdf)
# Read all the text into one string
print(data)
This package works far better and should help you out.

Related

Exception: No parsed pages. Please parse page first

I am trying to read a whole pdf file that is more then 250 pages. for that first i am converting my pdf to docx thorough the pdf2docx library.
here is a code;
from docx import Document
document = Document()
document.save('file.docx')
url = file_path #(google drive url where file was uploaded)
response = requests.get(url)
my_raw_data = response.content
with open("my_pdf.pdf", 'wb') as my_data:
my_data.write(my_raw_data)
open_pdf_file = open("my_pdf.pdf", 'rb')
cv = Converter(open_pdf_file)
cv.convert("roshni.docx")
Parse=parser.from_file("file.docx")
data=[]
for i in (Parse['content'].strip().split('\n')):
if len(i.split())<5:
pass
else:
data.append(i)
Text=data[1:-1]
But I am not able to read the file. getting error like "Exception: No parsed pages. Please parse page first."
How to solve this issue ? how to read a whole pdf using python ?

Convert the page data extracted from pdf file into csv file using PyPDF2

I have extracted data from a pdf file, but I am unable to convert it into a csv file
import PyPDF2 as PDF
PDFfile = open("path", "rb")
pdfread = PDF.PdfFileReader(PDFfile)
page = pdfread.getpage(0)
Page_content = page.extractText()
After extracting the text from a particular page, I want to export it to a csv file. How to do this?

Python does not print PDF with pyPDF2

I tried to print pages of a pdf document:
import PyPDF2
FILE_PATH = 'my.pdf'
with open(FILE_PATH, mode='rb') as f:
reader = PyPDF2.PdfFileReader(f)
page = reader.getPage(0) # I tried also other pages e.g 1,2,..
print(page.extractText())
But I only get a lot of blank space and no error message. Could it be that this pdf version (my.pdf) is not supported by PyPDF2?
This solved it (prints all pages of the document). Thanks
from pdfreader import SimplePDFViewer
fd = open("my.pdf", "rb")
viewer = SimplePDFViewer(fd)
for i in range(1,16): # need range from 1 - max number of pages +1
viewer.navigate(i)
viewer.render()
page_1_content=viewer.canvas.text_content
page_1_text = "".join(viewer.canvas.strings)
print (page_1_text)

Try pdfreader
from pdfreader import SimplePDFViewer
fd = open("my.pdf", "rb")
viewer = SimplePDFViewer(fd)
viewer.render()
page_0_content=viewer.canvas.text_content
page_0_text = "".join(viewer.canvas.strings)

If it's blank, either the PDF is being read and it's format can't be read by pypdf so it just outputs blank. Maybe put in the absolute filepath instead of relative filepath. If all else fails, try with different PDFs , and if there is a version that does work and yours doesn't, you might need to convert yours to that working type.

Python: downloading xml files in batch returns a damaged zip file

Drawing inspiration from this post, I am trying to download a bunch of xml files in batch from a website:
import urllib2
url='http://ratings.food.gov.uk/open-data/'
f = urllib2.urlopen(url)
data = f.read()
with open("C:\Users\MyName\Desktop\data.zip", "wb") as code:
code.write(data)
The zip file is created within seconds, but as I attempt to access it, an error window comes up:
Windows cannot open the folder.
The Compressed (zipped) Folder "C:\Users\MyName\Desktop\data.zip" is invalid.
What am I doing wrong here?

you are not opening file handles inside the zip file:
import urllib2
from bs4 import BeautifulSoup
import zipfile
url='http://ratings.food.gov.uk/open-data/'
fileurls = []
f = urllib2.urlopen(url)
mainpage = f.read()
soup = BeautifulSoup(mainpage, 'html.parser')
tablewrapper = soup.find(id='openDataStatic')
for table in tablewrapper.find_all('table'):
for link in table.find_all('a'):
fileurls.append(link['href'])
with zipfile.ZipFile("data.zip", "w") as code:
for url in fileurls:
print('Downloading: %s' % url)
f = urllib2.urlopen(url)
data = f.read()
xmlfilename = url.rsplit('/', 1)[-1]
code.writestr(xmlfilename, data)

You are doing nothing to encode this as zip file. If instead you choose to open it in a plain text editor such as notepad it should show you the raw xml.

PyPDF2 - merging pages from two different PDF files is not working

I'm trying to merge pages from two PDF files into a single PDF with a single page. So I tried the code below that uses PyPDF2:
from PyPDF2 import PdfFileReader,PdfFileWriter
import sys
f = sys.argv[1]
k = sys.argv[2]
print f,k
file1 = PdfFileReader(file(f, "rb"))
file2 = PdfFileReader(file(k, "rb"))
output = PdfFileWriter()
page = file1.getPage(0)
page.mergePage(file2.getPage(0))
output.addPage(page)
outputStream = file("join.pdf", "wb")
output.write(outputStream)
outputStream.close()
It produces a single file and single page with the contents of page 1 from file 1, but I don't find any data from page 1 of file2. Seems like it didn't get merged.

On using your exact same code, I am able to get two PDF as merged PDF in one page with the second one overlapping the first one, I referred this link for detailed information.
And, instead of file() it is better to use open() as per this Python Documentation, so I did that.
Also, I made slight changes in your code but still, the working is same and correct on my machine. I am using Ubuntu 16.04 with python 2.7.
Here is the code:
from PyPDF2 import PdfFileReader,PdfFileWriter
import sys
f = sys.argv[1]
k = sys.argv[2]
print f, k
file1 = PdfFileReader(open(f, "rb"))
file2 = PdfFileReader(open(k, "rb"))
output = PdfFileWriter()
page = file1.getPage(0)
page.mergePage(file2.getPage(0))
output.addPage(page)
with open("join.pdf", "wb") as outputStream:
output.write(outputStream)
I hope this helps.
UPDATE:
Here is the code which is working for me and merging the two pdf's page as single page.
from pyPdf import PdfFileWriter, PdfFileReader
from pdfnup import generateNup
initial_output = PdfFileWriter()
input1 = PdfFileReader(open("landscape1.pdf", "rb"))
input2 = PdfFileReader(open("landscape2.pdf", "rb"))
initial_output.addPage(input1.getPage(0))
initial_output.addPage(input2.getPage(0))
# creates a new pdf file with required pages as separate pages.
initial_output.write(file("final.pdf", "wb"))
# merges newly created pdf file pages as one.
generateNup("final.pdf", 2, "intermediate.pdf")
# overwrite and rotates the final.pdf
final_output = PdfFileWriter()
final_output.addPage(PdfFileReader(open("intermediate.pdf", "rb")).getPage(0).rotateClockwise(90))
final_output.write(open("final.pdf", "wb"))
I have added a new code and now it is also rotating the final pdf. Output PDF that you need is final.pdf
And here is the Google Drive link to my drive for PDF files. Also, I made slight changes into pdfnup.py for compatibility with my system for Immutableset if you want to use the same file then, you can find it too in the drive link above.

def merge_page(self, output_pdf,*input_pdfs):
a=len(input_pdfs)
print (a)
merge = PyPDF2.PdfFileMerger()
outputStream = open(output_pdf, "wb")
if a<2:
raise Exception ("Need Atleast Two Pdf for Merging")
else:
for x in input_pdfs:
merge.append(open(x,"rb"))
merge.write(outputStream)
outputStream.close()
For me this code is working in PyCharm and it can take n no of pdf files for merging into single pdf file but the no should be 2 or more less than that will give error.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Extracting PDF text into a text file using Python - Extraction error - python

Related

Exception: No parsed pages. Please parse page first

Convert the page data extracted from pdf file into csv file using PyPDF2

Python does not print PDF with pyPDF2

Python: downloading xml files in batch returns a damaged zip file

PyPDF2 - merging pages from two different PDF files is not working

Categories

Resources