Translate pdf file by keeping the formating Python - python

I am trying to translate PDFs files using translation API and output it as PDF by keeping the format same. My approach is to convert the PDF to word doc and to translate the file and then convert it back to PDF. But the problem, is there no efficient way to convert the PDF to word. I am trying to write my own program but the PDFs has lots of formats. So I guess it will take some effort to handle all the formats. So my question, is there any efficient way to translate there PDFs without losing the format or is there any efficient way to convert them to docx. I am using python as programing language.

Probably not.
PDFs aren't meant to be machine readable or editable, really; they describe formatted, laid-out, printable pages.

You can use pdfminer instead of API here an example:
from pdfminer.pdfinterp import PDFResourceManager, PDFPageInterpreter
from pdfminer.converter import TextConverter
from pdfminer.layout import LAParams
from pdfminer.pdfpage import PDFPage
from io import StringIO
def convert_pdf_to_txt(path):
rsrcmgr = PDFResourceManager()
retstr = StringIO()
codec = 'utf-8'
laparams = LAParams()
device = TextConverter(rsrcmgr, retstr, codec=codec, laparams=laparams)
fp = open(path, 'rb')
interpreter = PDFPageInterpreter(rsrcmgr, device)
password = ""
maxpages = 0
caching = True
pagenos=set()
for page in PDFPage.get_pages(fp, pagenos, maxpages=maxpages, password=password,caching=caching, check_extractable=True):
interpreter.process_page(page)
text = retstr.getvalue()
fp.close()
device.close()
retstr.close()
return text

Related

PDFMiner does not parse more than 1 page

I'm using PDFMiner6 with Python 3.5. It's far better than PyPDF2 (slower, but more accurate and doesn't spit out a bunch of letters that are not separated by spaces). I tried to parse this document:
https://www.ncbi.nlm.nih.gov/pmc/articles/PMC2963791/
(You can download the PDF free from the NIH website).
I used this code (it's part of a larger spider, but the rest of the code is not relevant to this question):
import io
from pdfminer.pdfinterp import PDFResourceManager, PDFPageInterpreter
from pdfminer.converter import TextConverter
from pdfminer.layout import LAParams
from pdfminer.pdfpage import PDFPage
class PDFMiner6(object):
def __init__(self):
pass
def PdfFileReader(self, fp):
text = []
rsrcmgr = PDFResourceManager()
retstr = io.StringIO()
codec = 'utf-8'
laparams = LAParams()
device = TextConverter(rsrcmgr, retstr, codec=codec, laparams=laparams)
interpreter = PDFPageInterpreter(rsrcmgr, device)
password = ""
maxpages = 0
caching = True
pagenos = set()
for page in PDFPage.get_pages(fp, pagenos, maxpages=maxpages, password=password, caching=caching, check_extractable=True):
interpreter.process_page(page)
output = retstr.getvalue()
text.append(output)
fp.close()
device.close()
retstr.close()
return text
It parses the first page perfectly, then stops. The rest of the document is not parsed.
I tested the same document using PyPDF2, it parses the entire document but outputs garbage without any spaces (hence I switched over to PDFMiner6). So I'm sure that it's not that the entire document is not being read but something wrong with the code that's parsing it. What is wrong?
EDIT: I went ahead and tested it on different PDF files with varying results - it parses some completely, whereas others, it stops at the first page. This is frustrating, as PDFMiner6 is a better parser compared to PyPDF2.
Could anybody help?
Make sure that the pdf is being opened by a pdf viewer and not a web browser. I had the same issue and this is how I fixed it.
It looks like pdfminer will view a pdf that is being opened by a web browser as one page. So you need to make sure that it is being opened by a pdf viewer so pdfminer recognizes that there are more than 1 page on the pdf.

Web scraping pdf using URL Python 3 TypeError

I am trying to code code that downloads a PDF from a URL. I found a method of doing this, but it was not written in Python 3 and used the file() function.
I tried replacing this with open() in the line fp = open(path, 'rb').
However I get this error:
TypeError: expected str, bytes or os.PathLike object, not HTTPResponse.
I cant find a solution online. Any help would be appreciated. Here is the code:
import bs4 as bs
import urllib.request
from urllib.request import urlopen
from pdfminer.pdfinterp import PDFResourceManager, PDFPageInterpreter
from pdfminer.converter import TextConverter
from pdfminer.pdfpage import PDFPage
from pdfminer.layout import LAParams
from io import StringIO
from io import open
def convert_pdf_to_txt(path):
rsrcmgr = PDFResourceManager()
retstr = StringIO()
codec = 'utf-8'
laparams = LAParams()
device = TextConverter(rsrcmgr, retstr, codec=codec, laparams=laparams)
fp = open(path, 'rb')
interpreter = PDFPageInterpreter(rsrcmgr, device)
password = ""
maxpages = 0
caching = True
pagenos=set()
for page in PDFPage.get_pages(fp, pagenos, maxpages=maxpages, password=password,caching=caching, check_extractable=True):
interpreter.process_page(page)
fp.close()
device.close()
stri = retstr.getvalue()
retstr.close()
return stri
pdfFile = urlopen("http://pythonscraping.com/pages/warandpeace/chapter1.pdf");
outputString = convert_pdf_to_txt(pdfFile)
print(outputString)
pdfFile.close()
Resources used
http://zempirians.com/ebooks/Ryan%20Mitchell-Web%20Scraping%20with%20Python_%20Collecting%20Data%20from%20the%20Modern%20Web-O'Reilly%20Media%20(2015).pdf
(page 101)
Extracting text from a PDF file using PDFMiner in python?
(the top answer)
Rather than struggle with an obsolescent version of pdfminer, I'd advise using pdfminer.six which is a more recent fork of the pdfminer library that's compatible with Python 3.
pip install pdfminer.six
You'll have to edit some of the import statements, but for the most part, the newer fork is a drop-in replacement.
So, now, after reading the body of the HTTP response (as per Adrian Tam's advice), you've got a PDF object. You can then call your conversion method with that object as a parameter:
def convert_pdf_to_txt(pdf_obj):
rsrcmgr = PDFResourceManager()
retstr = StringIO()
fp = BytesIO(pdf_obj) #get a file-like binary object
codec = 'utf-8'
laparams = LAParams()
device = TextConverter(rsrcmgr, retstr, codec=codec, laparams=laparams)
interpreter = PDFPageInterpreter(rsrcmgr, device)
password = ""
maxpages = 0
caching = True
pagenos=set()
for page in PDFPage.get_pages(fp, pagenos, maxpages=maxpages, password=password,caching=caching, check_extractable=True):
interpreter.process_page(page)
fp.close()
device.close()
stri = retstr.getvalue()
retstr.close()
print(stri)
Do this (you need to get bytes from a HTTP response object):
pdfResponse = urlopen("http://pythonscraping.com/pages/warandpeace/chapter1.pdf");
outputString = convert_pdf_to_txt(pdfResponse.read())
See https://docs.python.org/3/library/http.client.html#httpresponse-objects
But then you have to modify your convert_pdf_to_txt function to take raw data as input instead of file object, i.e., instead of
def convert_pdf_to_txt(path):
fp = open(path, 'rb')
...
for page in PDFPage.get_pages(fp, ...)
You have to do:
def convert_pdf_to_txt(rawbytes):
import io
fp = io.BytesIO(rawbytes)
...
for page in PDFPage.get_pages(fp, ...)
io.BytesIO helps you to convert a byte data into file-like byte streams (https://docs.python.org/3/library/io.html#binary-i-o) so you can afterwards pretend that as a file.
I didn't play with the PDF library before, but you may start in this direction.

Write html tags to text file in python

I've used pdfminer to convert complex (tables, figures) and very long pdfs to html. I want to parse the results further (e.g. extract tables, paragraphs etc) and then use sentence tokenizer from nltk to do further analysis. For this purposes I want to save the html to text file to figure out how to do the parsing. Unfortunately my code does not write html to txt:
from pdfminer.pdfinterp import PDFResourceManager, PDFPageInterpreter
from pdfminer.converter import HTMLConverter
from pdfminer.converter import TextConverter
from pdfminer.layout import LAParams
from pdfminer.pdfpage import PDFPage
from cStringIO import StringIO
def convert_pdf_to_html(path):
rsrcmgr = PDFResourceManager()
retstr = StringIO()
codec = 'utf-8'
laparams = LAParams()
device = HTMLConverter(rsrcmgr, retstr, codec=codec, laparams=laparams)
fp = file(path, 'rb')
interpreter = PDFPageInterpreter(rsrcmgr, device)
password = ""
maxpages = 0 #is for all
caching = True
pagenos=set()
for page in PDFPage.get_pages(fp, pagenos, maxpages=maxpages, password=password,caching=caching, check_extractable=True):
interpreter.process_page(page)
fp.close()
device.close()
str1 = retstr.getvalue()
retstr.close()
return str1
with open("D:/my_new_file.txt", "wb") as fh:
fh.write(str1)
Besides, the code prints the whole html string in the shell: how can I avoid it?
First, unless there's a trivial error,
the .txt write happens after the return function: txt file write is never executed!
Then, to suppress output to the console, just do that before running your routine:
import sys,os
oldstdout = sys.stdout # save to be able to restore it later
sys.stdout = os.devnull

Python, scrapy: pdf to text conversion: no error while running code, but don't seem to be generating any output

I am quite a newbie to python, scrapy and web scraping, so my question might seem naive. Apologies for that upfront.
I want to extract data from a pdf file, using scrapy. There are several questions on stackoverflow on this subject which I looked up and copied following code from one of the answers given. However, I am unable to see any output. Used print function in the code directly to see the output, tried writing the return value to an excel file, but that also is not showing any output. Am not getting any error either.
The code I am using is as follows:
from pdfminer.pdfinterp import PDFResourceManager, PDFPageInterpreter
from pdfminer.converter import TextConverter
from pdfminer.layout import LAParams
from pdfminer.pdfpage import PDFPage
from cStringIO import StringIO
def convert_pdf_to_txt(path):
rsrcmgr = PDFResourceManager()
retstr = StringIO()
codec = 'utf-8'
laparams = LAParams()
device = TextConverter(rsrcmgr, retstr, codec=codec, laparams=laparams)
fp = file(path, 'rb')
interpreter = PDFPageInterpreter(rsrcmgr, device)
password = ""
maxpages = 2
caching = True
pagenos=set()
for page in PDFPage.get_pages(fp, pagenos, maxpages=maxpages, password=password,caching=caching, check_extractable=True):
interpreter.process_page(page)
strval = retstr.getvalue()
print strval
fp.close()
device.close()
retstr.close()
return strval
Can anyone guide me where am I going wrong?
Thanks!
Tuhina

Extracting text from a PDF file using PDFMiner in python?

I am looking for documentation or examples on how to extract text from a PDF file using PDFMiner with Python.
It looks like PDFMiner updated their API and all the relevant examples I have found contain outdated code(classes and methods have changed). The libraries I have found that make the task of extracting text from a PDF file easier are using the old PDFMiner syntax so I'm not sure how to do this.
As it is, I'm just looking at source-code to see if I can figure it out.
Here is a working example of extracting text from a PDF file using the current version of PDFMiner(September 2016)
from pdfminer.pdfinterp import PDFResourceManager, PDFPageInterpreter
from pdfminer.converter import TextConverter
from pdfminer.layout import LAParams
from pdfminer.pdfpage import PDFPage
from io import StringIO
def convert_pdf_to_txt(path):
rsrcmgr = PDFResourceManager()
retstr = StringIO()
codec = 'utf-8'
laparams = LAParams()
device = TextConverter(rsrcmgr, retstr, codec=codec, laparams=laparams)
fp = open(path, 'rb')
interpreter = PDFPageInterpreter(rsrcmgr, device)
password = ""
maxpages = 0
caching = True
pagenos=set()
for page in PDFPage.get_pages(fp, pagenos, maxpages=maxpages, password=password,caching=caching, check_extractable=True):
interpreter.process_page(page)
text = retstr.getvalue()
fp.close()
device.close()
retstr.close()
return text
PDFMiner's structure changed recently, so this should work for extracting text from the PDF files.
Edit : Still working as of the June 7th of 2018. Verified in Python Version 3.x
Edit: The solution works with Python 3.7 at October 3, 2019. I used the Python library pdfminer.six, released on November 2018.
This works in May 2020 using PDFminer six in Python3.
Installing the package
$ pip install pdfminer.six
Importing the package
from pdfminer.high_level import extract_text
Using a PDF saved on disk
text = extract_text('report.pdf')
Or alternatively:
with open('report.pdf','rb') as f:
text = extract_text(f)
Using PDF already in memory
If the PDF is already in memory, for example if retrieved from the web with the requests library, it can be converted to a stream using the io library:
import io
response = requests.get(url)
text = extract_text(io.BytesIO(response.content))
Performance and Reliability compared with PyPDF2
PDFminer.six works more reliably than PyPDF2 (which fails with certain types of PDFs), in particular PDF version 1.7
However, text extraction with PDFminer.six is significantly slower than PyPDF2 by a factor of 6.
I timed text extraction with timeit on a 15" MBP (2018), timing only the extraction function (no file opening etc.) with a 10 page PDF and got the following results:
PDFminer.six: 2.88 sec
PyPDF2: 0.45 sec
pdfminer.six also has a huge footprint, requiring pycryptodome which needs GCC and other things installed pushing a minimal install docker image on Alpine Linux from 80 MB to 350 MB. PyPDF2 has no noticeable storage impact.
Update (2022-08-04): According to Martin Thoma, PyPDF2 has improved a lot in the past 2 years, so do give it a try as well. Here's his benchmark
terrific answer from DuckPuncher, for Python3 make sure you install pdfminer2 and do:
import io
from pdfminer.pdfinterp import PDFResourceManager, PDFPageInterpreter
from pdfminer.converter import TextConverter
from pdfminer.layout import LAParams
from pdfminer.pdfpage import PDFPage
def convert_pdf_to_txt(path):
rsrcmgr = PDFResourceManager()
retstr = io.StringIO()
codec = 'utf-8'
laparams = LAParams()
device = TextConverter(rsrcmgr, retstr, codec=codec, laparams=laparams)
fp = open(path, 'rb')
interpreter = PDFPageInterpreter(rsrcmgr, device)
password = ""
maxpages = 0
caching = True
pagenos = set()
for page in PDFPage.get_pages(fp, pagenos, maxpages=maxpages,
password=password,
caching=caching,
check_extractable=True):
interpreter.process_page(page)
fp.close()
device.close()
text = retstr.getvalue()
retstr.close()
return text
Full disclosure, I am one of the maintainers of pdfminer.six. It is a community-maintained version of pdfminer for python 3.
Nowadays, it has multiple api's to extract text from a PDF, depending on your needs. Behind the scenes, all of these api's use the same logic for parsing and analyzing the layout.
(All the examples assume your PDF file is called example.pdf)
Commandline
If you want to extract text just once you can use the commandline tool pdf2txt.py:
$ pdf2txt.py example.pdf
High-level api
If you want to extract text (properties) with Python, you can use the high-level api. This approach is the go-to solution if you want to programmatically extract information from a PDF.
from pdfminer.high_level import extract_text
# Extract text from a pdf.
text = extract_text('example.pdf')
# Extract iterable of LTPage objects.
pages = extract_pages('example.pdf')
Composable api
There is also a composable api that gives a lot of flexibility in handling the resulting objects. For example, it allows you to create your own layout algorithm. This method is suggested in the other answers, but I would only recommend this when you need to customize some component.
from io import StringIO
from pdfminer.converter import TextConverter
from pdfminer.layout import LAParams
from pdfminer.pdfdocument import PDFDocument
from pdfminer.pdfinterp import PDFResourceManager, PDFPageInterpreter
from pdfminer.pdfpage import PDFPage
from pdfminer.pdfparser import PDFParser
output_string = StringIO()
with open('example.pdf', 'rb') as in_file:
parser = PDFParser(in_file)
doc = PDFDocument(parser)
rsrcmgr = PDFResourceManager()
device = TextConverter(rsrcmgr, output_string, laparams=LAParams())
interpreter = PDFPageInterpreter(rsrcmgr, device)
for page in PDFPage.create_pages(doc):
interpreter.process_page(page)
print(output_string.getvalue())
Similar question and answer here. I'll try to keep them in sync.
this code is tested with pdfminer for python 3 (pdfminer-20191125)
from pdfminer.layout import LAParams
from pdfminer.converter import PDFPageAggregator
from pdfminer.pdfinterp import PDFResourceManager
from pdfminer.pdfinterp import PDFPageInterpreter
from pdfminer.pdfpage import PDFPage
from pdfminer.layout import LTTextBoxHorizontal
def parsedocument(document):
# convert all horizontal text into a lines list (one entry per line)
# document is a file stream
lines = []
rsrcmgr = PDFResourceManager()
laparams = LAParams()
device = PDFPageAggregator(rsrcmgr, laparams=laparams)
interpreter = PDFPageInterpreter(rsrcmgr, device)
for page in PDFPage.get_pages(document):
interpreter.process_page(page)
layout = device.get_result()
for element in layout:
if isinstance(element, LTTextBoxHorizontal):
lines.extend(element.get_text().splitlines())
return lines
I realize that this is an old question.
For anyone trying to use pdfminer, you should switch to pdfminer.six which is the currently maintained version.

Categories

Resources