Python - How to convert many separate PDFs to text? - python

Question: How can I read in many PDFs in the same path using Python package "slate"?
I have a folder with over 600 PDFs.
I know how to use the slate package to convert single PDFs to text, using this code:
migFiles = [filename for filename in os.listdir(path)
if re.search(r'(.*\.pdf$)', filename) != None]
with open(migFiles[0]) as f:
doc = slate.PDF(f)
len(doc)
However, this limits you to one PDF at a time, specified by "migFiles[0]" - 0 being the first PDF in my path file.
How can I read in many PDFs to text at once, retaining them as separate strings or txt files? Should I use another package? How could I create a "for loop" to read in all the PDFs in the path?

Try this version:
import glob
import os
import slate
for pdf_file in glob.glob("{}/{}".format(path,"*.pdf")):
with open(pdf_file) as pdf:
txt_file = "{}.txt".format(os.path.splitext(pdf_file)[0])
with open(txt_file,'w') as txt:
txt.write(slate.pdf(pdf))
This will create a text file with the same name as the pdf in the same directory as the pdf file with the converted contents.
Or, if you want to save the contents - try this version; but keep in mind if the translated content is large you may exhaust your available memory:
import glob
import os
import slate
pdf_as_text = {}
for pdf_file in glob.glob("{}/{}".format(path,"*.pdf")):
with open(pdf_file) as pdf:
file_without_extension = os.path.splitext(pdf_file)[0]
pdf_as_text[file_without_extension] = slate.pdf(pdf)
Now you can use pdf_as_text['somefile'] to get the text contents.

What you can do is use a simple loop:
docs = []
for filename in migFiles:
with open(filename) as f:
docs.append(slate.pdf(f))
# or instead of saving file to memory, just process it now
Then, docs[i] will hold the text of the (i+1)-th pdf file, and you can do whatever you want with the file whenever you want. Alternatively, you can process the file inside the for loop.
If you want to convert to text, you can do:
docs = []
separator = ' ' # The character you want to use to separate contents of
# consecutive pages; if you want the contents of each pages to be separated
# by a newline, use separator = '\n'
for filename in migFiles:
with open(filename) as f:
docs.append(separator.join(slate.pdf(f))) # turn the pages into plain-text
or
separator = ' '
for filename in migFiles:
with open(filename) as f:
txtfile = open(filename[:-4]+".txt",'w')
# if filename="abc.pdf", filename[:-4]="abc"
txtfile.write(separator.join(slate.pdf(f)))
txtfile.close()

Related

storing text into string buffer from multiple files using python

I want to extract text from multiple text files and the idea is that i have a folder and all text files are there in that folder.
I have tried and succesfully get the text but the thing is that when i use that string buffer somewhere else then only first text file text are visbile to me.
I want to store these texts to a particular string buffer.
what i have done:
import glob
import io
Raw_txt = " "
files = [file for file in glob.glob(r'C:\\Users\\Hp\\Desktop\\RAW\\*.txt')]
for file_name in files:
with io.open(file_name, 'r') as image_file:
content1 = image_file.read()
Raw_txt = content1
print(Raw_txt)
This Raw_txt buffer only works in this loop but i want this buffer somewhere else.
Thanks!
I think the issue is related to where you load the content of your text files.
Raw_txt is overwritten with each file.
I would recommend you to do something like this where the text is appended:
import glob
Raw_txt = ""
files = [file for file in glob.glob(r'C:\\Users\\Hp\\Desktop\\RAW\\*.txt')]
for file_name in files:
with open(file_name,"r+") as file:
Raw_txt += file.read() + "\n" # I added a new line in case you want to separate the different text content of each file
print(Raw_txt)
Also in order to read a text file you don't need io module.

Is there a way of using Python to change an image in multiple documents?

My company is undergoing a re-brand and as such the logo has changed. We have hundreds, if not thousands of word and excel docs that have the old logo. Can I use python to find and replace the image?
I found a code (below) that would do this for text (replacing Adber with Abder), can it be modified to change images?
import os, re
directory = os.listdir('/')
os.chdir('/Users/')
for file in directory:
open_file = open(file,'r')
read_file = open_file.read()
regex = re.compile('Adber')
read_file = regex.sub('Abder', read_file)
write_file = open(file,'w')
write_file.write(read_file)

Creating empty files when I rename them

I'm learning python and also english. And I have a problem that might be easy, but I can't solve it.
I have a folder of .txt's, I was able to extract by regular expression a sequence of 17 numbers of each one.I need to rename each file with the sequence I extracted from .txt
import os
import re
path_txt = (r'C:\Users\usuario\Desktop\files')
name_files = os.listdir(path_txt)
for TXT in name_files:
with open(path_txt + '\\' + TXT, "r") as content:
search = re.search(r'(\d{5}\.?\d{4}\.?\d{3}\.?\d{2}\.?\d{2}\-?\d)', content.read())
if search is not None:
print(search.group(0))
f = open(os.path.join( "Processes" , search.group(0) + ".txt"), "w")
for line in content:
print(line)
f.write(line)
f.close()
Its creating .txt's empty in the "Processes" folder, but named the way I need it.
ps: using Python 3
You are not renaming the files. Instead you are opening a file in write mode. If the file does not already exist it will be created.
Instead you want to rename the file:
# search the file for desired text
with open(os.path.join(path_txt, TXT), "r") as content:
search = re.search(r'(\d{5}\.?\d{4}\.?\d{3}\.?\d{2}\.?\d{2}\-?\d)', content.read())
# we do this *outside* the `with` so file is closed before rename
if search is not None:
os.rename(os.path.join(path_txt, TXT),
os.path.join("Processes" , search.group(0) + ".txt"))

How to extract text from a directory of PDF files efficiently with OCR?

I have a large directory with PDF files (images), how can I extract efficiently the text from all the files inside the directory?. So far I tried to:
import multiprocessing
import textract
def extract_txt(file_path):
text = textract.process(file_path, method='tesseract')
p = multiprocessing.Pool(2)
file_path = ['/Users/user/Desktop/sample.pdf']
list(p.map(extract_txt, file_path))
However, it is not working... it takes a lot of time (I have some documents that have 600 pages). Additionally: a) I do not know how to handle efficiently the directory transformation part. b) I would like to add a page separator, let's say: <start/age = 1> ... page content ... <end/page = 1>, but I have no idea of how to do this.
Thus, how can I apply the extract_txt function to all the elements of a directory that end with .pdf and return the same files in another directory but in a .txt format, and add a page separator with OCR text extraction?.
Also, I was curios about using google docs to make this task, is it possible to programmatically use google docs to solve the aforementioned text extracting problem?.
UPDATE
Regarding the "adding a page separator" issue (<start/age = 1> ... page content ... <end/page = 1>) after reading Roland Smith's answer I tried to:
from PyPDF2 import PdfFileWriter, PdfFileReader
import textract
def extract_text(pdf_file):
inputpdf = PdfFileReader(open(pdf_file, "rb"))
for i in range(inputpdf.numPages):
w = PdfFileWriter()
w.addPage(inputpdf.getPage(i))
outfname = 'page{:03d}.pdf'.format(i)
with open(outfname, 'wb') as outfile: # I presume you need `wb`.
w.write(outfile)
print('\n<begin page pos =' , i, '>\n')
text = textract.process(str(outfname), method='tesseract')
os.remove(outfname) # clean up.
print(str(text, 'utf8'))
print('\n<end page pos =' , i, '>\n')
extract_text('/Users/user/Downloads/ImageOnly.pdf')
However, I still have issues with the print() part, since instead of printing, it would be more useful to save into a file all the output. Thus, I tried to redirect the output to a a file:
sys.stdout=open("test.txt","w")
print('\n<begin page pos =' , i, '>\n')
sys.stdout.close()
text = textract.process(str(outfname), method='tesseract')
os.remove(outfname) # clean up.
sys.stdout=open("test.txt","w")
print(str(text, 'utf8'))
sys.stdout.close()
sys.stdout=open("test.txt","w")
print('\n<end page pos =' , i, '>\n')
sys.stdout.close()
Any idea of how to make the page extraction/separator trick and saving everything into a file?...
In your code, you are extracting the text, but you don't do anything with it.
Try something like this:
def extract_txt(file_path):
text = textract.process(file_path, method='tesseract')
outfn = file_path[:-4] + '.txt' # assuming filenames end with '.pdf'
with open(outfn, 'wb') as output_file:
output_file.write(text)
return file_path
This writes the text to file that has the same name but a .txt extension.
It also returns the path of the original file to let the parent know that this file is done.
So I would change the mapping code to:
p = multiprocessing.Pool()
file_path = ['/Users/user/Desktop/sample.pdf']
for fn in p.imap_unordered(extract_txt, file_path):
print('completed file:', fn)
You don't need to give an argument when creating a Pool. By default it will create as many workers as there are cpu-cores.
Using imap_unordered creates an iterator that starts yielding values as soon as they are available.
Because the worker function returned the filename, you can print it to let the user know that this file is done.
Edit 1:
The additional question is if it is possible to mark page boundaries. I think it is.
A method that would surely work is to split the PDF file into pages before the OCR. You could use e.g. pdfinfo from the poppler-utils package to find out the number of pages in a document. And then you could use e.g. pdfseparate from the same poppler-utils package to convert that one pdf file of N pages into N pdf files of one page. You could then OCR the single page PDF files separately. That would give you the text on each page separately.
Alternatively you could OCR the whole document and then search for page breaks. This will only work if the document has a constant or predictable header or footer on every page. It is probably not as reliable as the abovementioned method.
Edit 2:
If you need a file, write a file:
from PyPDF2 import PdfFileWriter, PdfFileReader
import textract
def extract_text(pdf_file):
inputpdf = PdfFileReader(open(pdf_file, "rb"))
outfname = pdf_file[:-4] + '.txt' # Assuming PDF file name ends with ".pdf"
with open(outfname, 'w') as textfile:
for i in range(inputpdf.numPages):
w = PdfFileWriter()
w.addPage(inputpdf.getPage(i))
outfname = 'page{:03d}.pdf'.format(i)
with open(outfname, 'wb') as outfile: # I presume you need `wb`.
w.write(outfile)
print('page', i)
text = textract.process(outfname, method='tesseract')
# Add header and footer.
text = '\n<begin page pos = {}>\n'.format(i) + text + '\n<end page pos = {}>\n'.format(i)
# Write the OCR-ed text to the output file.
textfile.write(text)
os.remove(outfname) # clean up.
print(text)

Python blank txt file creation

I am trying to create bulk text files based on list. A text file has number of lines/titles and aim is to create text files. Following is how my titles.txt looks like along with non-working code and expected output.
titles = open("C:\\Dropbox\\Python\\titles.txt",'r')
for lines in titles.readlines():
d_path = 'C:\\titles'
output = open((d_path.lines.strip())+'.txt','a')
output.close()
titles.close()
titles.txt
Title-A
Title-B
Title-C
new blank files to be created under directory c:\\titles\\
Title-A.txt
Title-B.txt
Title-C.txt
It's a little difficult to tell what you're attempting here, but hopefully this will be helpful:
import os.path
with open('titles.txt') as f:
for line in f:
newfile = os.path.join('C:\\titles',line.strip()) + '.txt'
ff = open( newfile, 'a')
ff.close()
If you want to replace existing files with blank files, you can open your files with mode 'w' instead of 'a'.
The following should work.
import os
titles='C:/Dropbox/Python/titles.txt'
d_path='c:/titles'
with open(titles,'r') as f:
for l in f:
with open(os.path.join(d_path,l.strip()),'w') as _:
pass

Categories

Resources