i am trying to rename a list of pdf files by extracting the name from the file using PyPdf. i tried to use a for loop to rename the files but i always get an error with code 32 saying that the file is being used by another process. I am using python2.7
Here's my code
import os, glob
from pyPdf import PdfFileWriter, PdfFileReader
# this function extracts the name of the file
def getName(filepath):
output = PdfFileWriter()
input = PdfFileReader(file(filepath, "rb"))
output.addPage(input.getPage(0))
outputStream = file(filepath + '.txt', 'w')
output.write(outputStream)
outputStream.close()
outText = open(filepath + '.txt', 'rb')
textString = outText.read()
outText.close()
nameStart = textString.find('default">')
nameEnd = textString.find('_SATB', nameStart)
nameEnd2 = textString.find('</rdf:li>', nameStart)
if nameStart:
testName = textString[nameStart+9:nameEnd]
if len(testName) <= 100:
name = testName + '.pdf'
else:
name = textString[nameStart+9:nameEnd2] + '.pdf'
return name
pdfFiles = glob.glob('*.pdf')
m = len(pdfFiles)
for each in pdfFiles:
newName = getName(each)
os.rename(each, newName)
Consider using the with directive of Python. With it you do not need to handle closing the file yourself:
def getName(filepath):
output = PdfFileWriter()
with file(filepath, "rb") as pdfFile:
input = PdfFileReader(pdfFile)
...
You're not closing the input stream (the file) used by the pdf reader.
Thus, when you try to rename the file, it's still open.
So, instead of this:
input = PdfFileReader(file(filepath, "rb"))
Try this:
inputStream = file(filepath, "rb")
input = PdfFileReader(inputStream)
(... when done with this file...)
inputStream.close()
It does not look like you close the file object associated with the PDF reader object. Though maybe at tne end of the function it is closed automatically, but to be sure you might want to create a separate file object which you pass to the PdfFileReader and then close the file handle when done. Then rename.
The below was from SO: How to close pyPDF "PdfFileReader" Class file handle
import os.path
from pyPdf import PdfFileReader
fname = 'my.pdf'
fh = file(fname, "rb")
input = PdfFileReader(fh)
fh.close()
os.rename(fname, 'my_renamed.pdf')
Related
I am using the code below to get any free journal pdfs from pubmed. It does downloadload something that when I look at it, just consists of the number 1.. Any ideas on where I am going wrong? Thank you
import metapub
from urllib.request import urlretrieve
import textract
from pathlib import Path
another_path='/content/Articles/'
pmid_list=['35566889','33538053', '30848212']
for i in range(len(pmid_list)):
query=pmid_list[i]
#for ind in pmid_df.index:
# query= pmid_df['PMID'][ind]
url = metapub.FindIt(query).url
try:
urlretrieve(url)
file_name = query
out_file = another_path + file_name
with open(out_file, "w") as textfile:
textfile.write(textract.process(out_file,extension='pdf',method='pdftotext',encoding="utf_8",
))
except:
continue
I see two mistakes.
First: urlretrieve(url) saves data in temporary file with random filename - so you can't access it because you don't know its filename. You should use second parameter to save it with own filename.
urlretrieve(url, file_name)
Second: you use the same out_file to process file (process(out_file)) and write result (open(out_file, 'w')) - but first you use open() which deletes all content in file and later it will process empty file. You should first process file and later open it for writing.
data = textract.process(out_file, extension='pdf', method='pdftotext', encoding="utf_8")
with open(out_file, "wb") as textfile: # save bytes
textfile.write(data)
or you should write result with different name (i.e with extension .txt)`
Full working example with other small changes
import os
from urllib.request import urlretrieve
import metapub
import textract
#another_path = '/content/Articles/'
another_path = './'
pmid_list = ['35566889','33538053', '30848212']
for query in pmid_list:
print('query:', query)
url = metapub.FindIt(query).url
print('url:', url)
if url:
try:
out_file = os.path.join(another_path, query)
print('out_file:', out_file)
print('... downloading')
urlretrieve(url, out_file + '.pdf')
print('... processing')
data = textract.process(out_file + '.pdf', extension='pdf', method='pdftotext', encoding="utf_8")
print('... saving')
with open(out_file + '.txt', "wb") as textfile: # save bytes
textfile.write(data)
print('... OK')
except Exception as ex:
print('Exception:', ex)
When I pass the file name directly as below, data is being written to the output file.
Rpt_file_wfl = open('output.csv','a')
Rpt_file_wfl.write(output)
But when I pass the filename as a variable, the file is getting created but there is no data.
OUT_PATH = E:\MYDRIVE
outDir = py_script
outFiles = output.csv
Rpt_file_wfl = open(OUT_PATH+outDir+outFiles[0],'a')
Rpt_file_wfl.write(output)
I do close the file in the end.
Why would the data not be written with the above code.
Try to use os.path
import os
output_text = 'some text'
drive_path = 'E:'
drive_dir = 'Mydrive'
out_dir = 'py_script'
out_file = 'output.csv'
full_path = os.path.join(drive_path, drive_dir, out_dir, out_file)
with open(full_path, 'a', encoding='utf-8') as file:
file.write(output_text)
If it doesn't work - try to .replace() delimiters, like:
full_path = full_path.replace('/', '\\')
Or else:
full_path = full_path.replace('\\', '/')
Here`s example of working code:
OUT_PATH='D:\\output\\'
outDir='scripts\\'
outFiles=['1.csv', '2.csv']
path = OUT_PATH+outDir+outFiles[0]
output='Example output'
with open(path, 'a') as file:
file.write(output)
I was attempting to write code using the csv, os, and PyPDF2 packages to extract text from numerous pdf files within a directory and then place the data in a csv. The following code illustrates my efforts (it runs but provides no output):
import PyPDF2
import csv
import os
for filename in os.listdir(os.getcwd()):
if filename endswith('.pdf'):
pdfFileobject = open(filename, 'rb')
pdfUnderstander = PyPDF2.PdfFileReader(pdfFileObject)
numberpages = pdfUnderstander.getNumPages()
increment = 0
text = ""
while increment < numberpages:
pdfPage = pdfUnderstander.getPage(increment)
increment += 1
text += pdfPage.extractText()
print(text)
I have not also quite gotten to the part yet for csv because of the failure of the part above to work, but would like some advice on how that could be stored as well.
I guess you are making mistake while extracting filename.
Mistakes you are making: - Variable Name.
**pdfFileobject** = open(filename, 'rb')
pdfUnderstander = PyPDF2.PdfFileReader(**pdfFileObject**)
Try this code:
path = r'Dir contains PDFs'
for filename in os.listdir(path):
if filename.split(".")[-1] == 'pdf':
print(filename)
pdfFileObject = open(os.path.join(path, filename), 'rb')
pdfUnderstander = PyPDF2.PdfFileReader(pdfFileObject)
numberpages = pdfUnderstander.getNumPages()
increment = 0
text = ""
while increment < numberpages:
pdfPage = pdfUnderstander.getPage(increment)
increment += 1
text += pdfPage.extractText()
print(text)
The code I am working with takes in a .pdf file, and outputs a .txt file. My question is, how do I create a loop (probably a for loop) which runs the code over and over again on all files in a folder which end in ".pdf"? Furthermore, how do I change the output each time the loop runs so that I can write a new file each time, that has the same name as the input file (ie. 1_pet.pdf > 1_pet.txt, 2_pet.pdf > 2_pet.txt, etc.)
Here is the code so far:
path="2_pet.pdf"
content = getPDFContent(path)
encoded = content.encode("utf-8")
text_file = open("Output.txt", "w")
text_file.write(encoded)
text_file.close()
The following script solve your problem:
import os
sourcedir = 'pdfdir'
dl = os.listdir('pdfdir')
for f in dl:
fs = f.split(".")
if fs[1] == "pdf":
path_in = os.path.join(dl,f)
content = getPDFContent(path_in)
encoded = content.encode("utf-8")
path_out = os.path.join(dl,fs[0] + ".txt")
text_file = open(path_out, 'w')
text_file.write(encoded)
text_file.close()
Create a function that encapsulates what you want to do to each file.
import os.path
def parse_pdf(filename):
"Parse a pdf into text"
content = getPDFContent(filename)
encoded = content.encode("utf-8")
## split of the pdf extension to add .txt instead.
(root, _) = os.path.splitext(filename)
text_file = open(root + ".txt", "w")
text_file.write(encoded)
text_file.close()
Then apply this function to a list of filenames, like so:
for f in files:
parse_pdf(f)
One way to operate on all PDF files in a directory is to invoke glob.glob() and iterate over the results:
import glob
for path in glob.glob('*.pdf')
content = getPDFContent(path)
encoded = content.encode("utf-8")
text_file = open("Output.txt", "w")
text_file.write(encoded)
text_file.close()
Another way is to allow the user to specify the files:
import sys
for path in sys.argv[1:]:
...
Then the user runs your script like python foo.py *.pdf.
You could use a recursive function to search the folders and all subfolders for files that end with pdf. Than take those files and then create a text file for it.
It could be something like:
import os
def convert_PDF(path, func):
d = os.path.basename(path)
if os.path.isdir(path):
[convert_PDF(os.path.join(path,x), func) for x in os.listdir(path)]
elif d[-4:] == '.pdf':
funct(path)
# based entirely on your example code
def convert_to_txt(path):
content = getPDFContent(path)
encoded = content.encode("utf-8")
file_path = os.path.dirname(path)
# replace pdf with txt extension
file_name = os.path.basename(path)[:-4]+'.txt'
text_file = open(file_path +'/'+file_name, "w")
text_file.write(encoded)
text_file.close()
convert_PDF('path/to/files', convert_to_txt)
Because the actual operation is changeable, you can replace the function with whatever operation you need to perform (like using a different library, converting to a different type, etc.)
I'm trying to create a html file and then convert this file to a pdf file using wkhtmltopdf http://wkhtmltopdf.org/
inputfilename = "/tmp/inputfile.html"
outputfilename = "/tmp/outputfile.pdf"
f = open(inputfilename, 'w')
f.write(html)
f.close()
f1 = open(outputfilename, 'w')
ret = convert2pdf(f,outputfilename)
f1.close()
In convert2pdf I'm doing:
def convert2pdf(htmlfilename,outputpdf):
import subprocess
commands_to_run = ['/wkhtmltopdf-amd64','htmlfilename', 'outputpdf']
subprocess.call(commands_to_run)
Both input/output files are created on the fly. Input file is perfect but output pdf created using wkhtmltopdf is empty. Can you suggest what am I doing wrong.
I think you just have to change
commands_to_run = ['/wkhtmltopdf-amd64','htmlfilename', 'outputpdf']
to
commands_to_run = ['/wkhtmltopdf-amd64', htmlfilename, outputpdf]
and instead of
ret = convert2pdf(f,outputfilename)
do
ret = convert2pdf(inputfilename, outputfilename)