I'm trying to create a html file and then convert this file to a pdf file using wkhtmltopdf http://wkhtmltopdf.org/
inputfilename = "/tmp/inputfile.html"
outputfilename = "/tmp/outputfile.pdf"
f = open(inputfilename, 'w')
f.write(html)
f.close()
f1 = open(outputfilename, 'w')
ret = convert2pdf(f,outputfilename)
f1.close()
In convert2pdf I'm doing:
def convert2pdf(htmlfilename,outputpdf):
import subprocess
commands_to_run = ['/wkhtmltopdf-amd64','htmlfilename', 'outputpdf']
subprocess.call(commands_to_run)
Both input/output files are created on the fly. Input file is perfect but output pdf created using wkhtmltopdf is empty. Can you suggest what am I doing wrong.
I think you just have to change
commands_to_run = ['/wkhtmltopdf-amd64','htmlfilename', 'outputpdf']
to
commands_to_run = ['/wkhtmltopdf-amd64', htmlfilename, outputpdf]
and instead of
ret = convert2pdf(f,outputfilename)
do
ret = convert2pdf(inputfilename, outputfilename)
Related
How to save image files after generate image file in python script?
def mols_to_pngs(mols, basename = "test"):
filenames = []
for i, mol in enumerate(mols):
filename = "%s%d.png" % (basename, i)
Draw.MolToFile(mol,filename)
filenames.append(filename)
return filenames
and I want to this process automatically using csv file and python script
To automatically save image files generated from a python script using a csv file, you need to read the data from the csv file and process it in your script. Here's a basic example:
import csv
import sys
def mols_to_pngs(mols, basename = "test"):
filenames = []
for i, mol in enumerate(mols):
filename = "%s%d.png" % (basename, i)
Draw.MolToFile(mol,filename)
filenames.append(filename)
return filenames
# Read the csv file
try:
with open('data.csv', 'r') as file:
reader = csv.reader(file)
# Skip the header row
next(reader)
mols = []
for row in reader:
# Process the data from the csv file
mol = process_data(row)
mols.append(mol)
except IOError:
print("Could not read file:", file)
sys.exit()
# Save the image files
filenames = mols_to_pngs(mols)
print("Saved the following files:", filenames)
Note that you need to replace the process_data function with your own implementation to process the data from the csv file and create the mol objects. Additionally, you may need to modify the code to match your specific use case.
When I pass the file name directly as below, data is being written to the output file.
Rpt_file_wfl = open('output.csv','a')
Rpt_file_wfl.write(output)
But when I pass the filename as a variable, the file is getting created but there is no data.
OUT_PATH = E:\MYDRIVE
outDir = py_script
outFiles = output.csv
Rpt_file_wfl = open(OUT_PATH+outDir+outFiles[0],'a')
Rpt_file_wfl.write(output)
I do close the file in the end.
Why would the data not be written with the above code.
Try to use os.path
import os
output_text = 'some text'
drive_path = 'E:'
drive_dir = 'Mydrive'
out_dir = 'py_script'
out_file = 'output.csv'
full_path = os.path.join(drive_path, drive_dir, out_dir, out_file)
with open(full_path, 'a', encoding='utf-8') as file:
file.write(output_text)
If it doesn't work - try to .replace() delimiters, like:
full_path = full_path.replace('/', '\\')
Or else:
full_path = full_path.replace('\\', '/')
Here`s example of working code:
OUT_PATH='D:\\output\\'
outDir='scripts\\'
outFiles=['1.csv', '2.csv']
path = OUT_PATH+outDir+outFiles[0]
output='Example output'
with open(path, 'a') as file:
file.write(output)
I have been trying to convert a number of DOCX files into TXT.
It works for a single file using the code below:
import docx
def getText(filename):
doc = docx.Document(filename)
fullText = []
for para in doc.paragraphs:
fullText.append(para.text)
return '\n'.join(fullText)
if __name__ == '__main__':
filename='/content/drive/My Drive/path/file.DOCX'; #file name
fullText=getText(filename)
print (fullText)
file = open("copy.txt", "w")
file.write(fullText)
file.close()
I tried different options (i.e. glob) but did not manage get it to do the above operation on all files in a folder.
Ideally the output should be 1 large text file and not separate ones.
I will need to do some formatting and assigning of IDs in that file in a next step.
Thank you for your help!
corp-alt
With file = open("copy.txt", "w") you open the file and replace its content with write().
With file = open("copy.txt", "a") you append to the existing file with write(). Or maybe even better:
With file = open("copy.txt", "a+") you append to an existing file with write(), or create a new file if it doesn't exist yet.
To go through all files in a folder you can loop over them:
import os
import docx
def getText(filename):
doc = docx.Document(filename)
fullText = []
for para in doc.paragraphs:
fullText.append(para.text)
return '\n'.join(fullText)
if __name__ == '__main__':
foldername='/content/drive/My Drive/path/'; #folder name
all_files = os.listdir(foldername) #get all filenames
docx_files = [ filename for filename in all_files if filename.endswith('.docx') ] #get .docx filenames
file = open("copy.txt", "a+")
for docx_file in docx_files: #loop over .docx files
fullText=getText(filename)
file.write(fullText)
file.close()
The code I am working with takes in a .pdf file, and outputs a .txt file. My question is, how do I create a loop (probably a for loop) which runs the code over and over again on all files in a folder which end in ".pdf"? Furthermore, how do I change the output each time the loop runs so that I can write a new file each time, that has the same name as the input file (ie. 1_pet.pdf > 1_pet.txt, 2_pet.pdf > 2_pet.txt, etc.)
Here is the code so far:
path="2_pet.pdf"
content = getPDFContent(path)
encoded = content.encode("utf-8")
text_file = open("Output.txt", "w")
text_file.write(encoded)
text_file.close()
The following script solve your problem:
import os
sourcedir = 'pdfdir'
dl = os.listdir('pdfdir')
for f in dl:
fs = f.split(".")
if fs[1] == "pdf":
path_in = os.path.join(dl,f)
content = getPDFContent(path_in)
encoded = content.encode("utf-8")
path_out = os.path.join(dl,fs[0] + ".txt")
text_file = open(path_out, 'w')
text_file.write(encoded)
text_file.close()
Create a function that encapsulates what you want to do to each file.
import os.path
def parse_pdf(filename):
"Parse a pdf into text"
content = getPDFContent(filename)
encoded = content.encode("utf-8")
## split of the pdf extension to add .txt instead.
(root, _) = os.path.splitext(filename)
text_file = open(root + ".txt", "w")
text_file.write(encoded)
text_file.close()
Then apply this function to a list of filenames, like so:
for f in files:
parse_pdf(f)
One way to operate on all PDF files in a directory is to invoke glob.glob() and iterate over the results:
import glob
for path in glob.glob('*.pdf')
content = getPDFContent(path)
encoded = content.encode("utf-8")
text_file = open("Output.txt", "w")
text_file.write(encoded)
text_file.close()
Another way is to allow the user to specify the files:
import sys
for path in sys.argv[1:]:
...
Then the user runs your script like python foo.py *.pdf.
You could use a recursive function to search the folders and all subfolders for files that end with pdf. Than take those files and then create a text file for it.
It could be something like:
import os
def convert_PDF(path, func):
d = os.path.basename(path)
if os.path.isdir(path):
[convert_PDF(os.path.join(path,x), func) for x in os.listdir(path)]
elif d[-4:] == '.pdf':
funct(path)
# based entirely on your example code
def convert_to_txt(path):
content = getPDFContent(path)
encoded = content.encode("utf-8")
file_path = os.path.dirname(path)
# replace pdf with txt extension
file_name = os.path.basename(path)[:-4]+'.txt'
text_file = open(file_path +'/'+file_name, "w")
text_file.write(encoded)
text_file.close()
convert_PDF('path/to/files', convert_to_txt)
Because the actual operation is changeable, you can replace the function with whatever operation you need to perform (like using a different library, converting to a different type, etc.)
i am trying to rename a list of pdf files by extracting the name from the file using PyPdf. i tried to use a for loop to rename the files but i always get an error with code 32 saying that the file is being used by another process. I am using python2.7
Here's my code
import os, glob
from pyPdf import PdfFileWriter, PdfFileReader
# this function extracts the name of the file
def getName(filepath):
output = PdfFileWriter()
input = PdfFileReader(file(filepath, "rb"))
output.addPage(input.getPage(0))
outputStream = file(filepath + '.txt', 'w')
output.write(outputStream)
outputStream.close()
outText = open(filepath + '.txt', 'rb')
textString = outText.read()
outText.close()
nameStart = textString.find('default">')
nameEnd = textString.find('_SATB', nameStart)
nameEnd2 = textString.find('</rdf:li>', nameStart)
if nameStart:
testName = textString[nameStart+9:nameEnd]
if len(testName) <= 100:
name = testName + '.pdf'
else:
name = textString[nameStart+9:nameEnd2] + '.pdf'
return name
pdfFiles = glob.glob('*.pdf')
m = len(pdfFiles)
for each in pdfFiles:
newName = getName(each)
os.rename(each, newName)
Consider using the with directive of Python. With it you do not need to handle closing the file yourself:
def getName(filepath):
output = PdfFileWriter()
with file(filepath, "rb") as pdfFile:
input = PdfFileReader(pdfFile)
...
You're not closing the input stream (the file) used by the pdf reader.
Thus, when you try to rename the file, it's still open.
So, instead of this:
input = PdfFileReader(file(filepath, "rb"))
Try this:
inputStream = file(filepath, "rb")
input = PdfFileReader(inputStream)
(... when done with this file...)
inputStream.close()
It does not look like you close the file object associated with the PDF reader object. Though maybe at tne end of the function it is closed automatically, but to be sure you might want to create a separate file object which you pass to the PdfFileReader and then close the file handle when done. Then rename.
The below was from SO: How to close pyPDF "PdfFileReader" Class file handle
import os.path
from pyPdf import PdfFileReader
fname = 'my.pdf'
fh = file(fname, "rb")
input = PdfFileReader(fh)
fh.close()
os.rename(fname, 'my_renamed.pdf')