How can I merge two PDF files without overlapping content

How can I merge two PDF files without overlapping content - python

Using another stackoverflow question & answer, I was able to locate code which partially resolves what I am trying to do Merge PDF files.
However, this modified code results in the contents of two PDFs overlapping each outer. I am trying to stack them or vertically concatenate the results:
Example:
PDF1 Contents -> "Hello World"
PDF2 Contents -> "I am Bill"
Code below results in the following overlapping image:
Desired results would look as follows:
Code Used resulting in overlapping imge:
import pdfrw
dirPATH = r'c:\users\<username>\projects\concat_pdfs'
pdf1 = os.path.join(dirPATH, 'PDF1.pdf')
pdf2 = os.path.join(dirPATH, 'PDF2.pdf')
def concat_pdfs(pdf1, pdf2, output):
form = pdfrw.PdfReader(pdf1)
olay = pdfrw.PdfReader(pdf2)
for form_page, overlay_page in zip(form.pages, olay.pages):
merge_obj = pdfrw.PageMerge()
overlay = merge_obj.add(overlay_page)[0]
pdfrw.PageMerge(form_page).add(overlay).render()
writer = pdfrw.PdfWriter()
writer.write(output, form)
concat_pdfs(section1, section2, 'result.pdf')
Thanks in advance!

Have you tried
def combine_pdfs(dir_path1, dir_path2, save_path):
pdf1 = pdfrw.PdfReader(dir_path1)
pdf2 = pdfrw.PdfReader(dir_path2)
pdf_writer = pdfrw.PdfWriter()
for page in pdf1.pages:
pdf_writer.addpage(page)
for page in pdf2.pages:
pdf_writer.addpage(page)
pdf_writer.write(save_path)

Here's an example using PyPDF2 library:
merger = PdfFileMerger()
for filename in files:
f = files[filename]
loc = "/tmp/" + secure_filename(filename).replace(".pdf", "") + "_" + str(time.time()) + ".pdf"
f.save(loc)
f.close()
reader = PdfFileReader(loc, "rb")
merger.append(reader)
dest = "/tmp/merged_" + str(time.time()) + ".pdf"
merger.write(dest)
Here is another using pike pdf:
pdf = Pdf.new()
for filename in files:
f = files[filename]
loc = "/tmp/" + secure_filename(filename).replace(".pdf", "") + "_" + str(time.time()) + ".pdf"
f.save(loc)
f.close()
reader = Pdf.open(loc)
pdf.pages.extend(reader.pages)
dest = "/tmp/merged_" + str(time.time()) + ".pdf"
pdf.save(dest)
Imports might look something like:
import time
import pdfkit
import os
from PyPDF2 import PdfFileMerger, PdfFileReader
from werkzeug.utils import secure_filename
from pikepdf import Pdf

Related

adding a prefix to a file name while downloading from URL via a CSV sheet with pandas

I am trying to add a prefix from one column to a filename. This code downloads a picture from a given URL in a CSV. That file name should have the prefix + filename.
The CSV has two columns, the first with the URLs and the second with a uniqueID named "idnumber"
Its been a while since i coded and maybe mixing SQL with Python...
Here is what I have so far:
import pandas as pd
import urllib.request
def url_to_jpg(i, url,file_path ):
filename = 'idname-image-{}.jpg'.format(i)
full_path = '{}{}'.format(file_path, filename)
urllib.request.urlretrieve(url, full_path)
print('{} saved.'.format(filename))
return None
FILENAME = 'Image_Download.csv'
FILE_PATH = 'images/'
urls = pd.read_csv(FILENAME)
idnumber = urls.idnumber
for i, URL in enumerate(urls.values):
url_to_jpg(i, URL[0],FILE_PATH)

import urllib.request
import time
def url_to_jpg(i, name,url,file_path ):
time.sleep(5)
filename = name
full_path = '{}{}'.format(file_path, filename)
urllib.request.urlretrieve(url, full_path)
print('{} saved.'.format(filename))
return None
FILENAME = 'Image_Download.csv'
FILE_PATH = 'images/'
df = pd.read_csv(FILENAME)
name = df.name
for i, URL, in enumerate(df.values):
url_to_jpg(i, name[i] ,URL[0],FILE_PATH)

Python: Change file names to the names of people in a list

I have a couple slides, each slide corresponds to a person. I need to name each file (.pptx) after the individual name it references. A lot of the examples I see on mass renaming have the renaming become sequential like:
file1
file2
file3
I need:
bob.pptx
sue.pptx
jack.pptx
I was able to change names using os found on this site https://www.marsja.se/rename-files-in-python-a-guide-with-examples-using-os-rename/:
import os, fnmatch
file_path = 'C:\\Users\\Documents\\Files_To_Rename\\Many_Files\\'
files_to_rename = fnmatch.filter(os.listdir(file_path), '*.pptx')
print(files_to_rename)
new_name = 'Datafile'
for i, file_name in enumerate(files_to_rename):
new_file_name = new_name + str(i) + '.pptx'
os.rename(file_path + file_name,
file_path + new_file_name)
But again, this just names it:
Datafile1
Datafile2
etc

my example
import os from pathlib
import Path
files = os.listdir("c:\\tmp\\")
for key in range(0, len(files)):
print (files[key])
os.rename("c:\\tmp\\" + files[key], "c:\\tmp\\" + files[key].replace("-",""))
Path("c:\\tmp\\" + files[key] + '.ok').touch() # if u need add some extension

Here's how I ran your code (avoiding file paths I don't have!), getting it to print output not just rename
import os, fnmatch
file_path = '.\\'
files_to_rename = fnmatch.filter(os.listdir(file_path), '*.pptx')
print(files_to_rename)
new_name = 'Datafile'
for i, file_name in enumerate(files_to_rename):
new_file_name = new_name + str(i) + '.pptx'
print (file_path + new_file_name)
os.rename(file_path + file_name,
file_path + new_file_name)
This gave me
.\Datafile0.pptx
.\Datafile1.pptx
...
and did give me the correct sequence of pptx files in that folder.
So I suspect the problem is that you are getting the file names you want, but you can't see them in Windows. Solution: show file types in Windows. Here's one of many available links as to how: https://www.thewindowsclub.com/show-file-extensions-in-windows

Thank you everyone for your suggestions, I think I found it with a friend's help:
import os, fnmatch
import pandas as pd
file_path = 'C:\\Users\\Documents\\FolderwithFiles\\'
files_to_rename = fnmatch.filter(os.listdir(file_path), '*.pptx') #looks for any .ppt in path, can make any ext
df = pd.read_excel('Names.xlsx') #make a list of names in an xl, this won't read the header, header should be Names, then list your names)
for i, file_name in zip(df['Names'], files_to_rename): #zip instead of a nest for loop
new_file_name = i + '.pptx'
os.rename(file_path + file_name, file_path + new_file_name)
print(new_file_name)

Updating .txt file with year from subfolder

I am trying to learn how to update a .txt filename when os.walk switches from files in one directory to files in another directory. I am not sure about how to do this. I tried iterating through dirs and then files, but this was unsuccessful as the .pdf files would not display. Here is a full look at the code I am working on.
The directory looks like this [research] -> [2014] -> Article1.pdf, article2.pdf article3.pdf
[2015] -> Article4.pdf, article5.pdf article6.pdf
[2016] -> Article7.pdf, article8.pdf article9.pdf
from PIL import Image
import pytesseract
from pdf2image import convert_from_path
import os
pytesseract.pytesseract.tesseract_cmd = r'/usr/local/Cellar/tesseract/4.1.1/bin/tesseract'
def image_ocr(image_path, output_txt_file_name, All_text):
image_text = pytesseract.image_to_string(
image_path, lang='eng+ces', config='--psm 1')
with open(output_txt_file_name, 'a', encoding='utf-8') as f:
f.write(image_text)
with open(All_text, 'a', encoding='utf-8') as f:
f.write(image_text)
num = 0
year = 1973
year_being_recorded = 'txt_files/' + str(year) + '_article.txt'
cumulative_text = 'txt_files/cumulative.txt'
for root, dirs, files in os.walk('articles'):
for file_ in files:
if file_.endswith('.pdf'):
article_path = str(root) + '/' + str(file_)
pages = convert_from_path(article_path, 500)
for page in pages:
name = 'jpegs/a_file_' + str(num) + '.jpeg'
page.save(name, 'JPEG')
image_ocr(name, year_being_recorded, cumulative_text)
num = num + 1

Convert Microsoft Word document to PDF using Python

I have tons of Word and Excel files. I want to convert many Word files in folders by sub folders to PDF, and I try following code.
This code is not active (I mean there aren't Word convert to PDF) although no error.
What could be the problem? Is there another solution?
This is my code:
import os
from win32com import client
path = 'D:\programing\test'
word_file_names = []
word = client.DispatchEx("Word.Application")
for dirpath, dirnames, filenames in os.walk(path):
print (dirpath)
for f in filenames:
if f.lower().endswith(".docx") and re.search('Addendum', f):
new_name = f.replace(".docx", r".pdf")
in_file = word_file_names.append(dirpath + "\\" + f)
new_file = word_file_names.append(dirpath + "\\" + new_name)
doc = word.Documents.Open(in_file)
doc.SaveAs(new_file, FileFormat = 17)
doc.Close()
if f.lower().endswith(".doc") and re.search('Addendum', f):
new_name = f.replace(".doc", r".pdf")
in_file = word_file_names.append(dirpath + "\\" + f)
new_file = word_file_names.append(dirpath + "\\" + new_name)
doc = word.Documents.Open(in_file)
doc.SaveAs(new_file, FileFormat = 17)
doc.Close()
word.Quit()

This is way easier:
from docx2pdf import convert
convert(word_path, pdf_path)

You can use comtypes,
from comtypes.client import CreateObject
import os
folder = "folder path"
wdToPDF = CreateObject("Word.Application")
wdFormatPDF = 17
files = os.listdir(folder)
word_files = [f for f in files if f.endswith((".doc", ".docx"))]
for word_file in word_files:
word_path = os.path.join(folder, word_file)
pdf_path = word_path
if pdf_path[-3:] != 'pdf':
pdf_path = pdf_path + ".pdf"
if os.path.exists(pdf_path):
os.remove(pdf_path)
pdfCreate = wdToPDF.Documents.Open(word_path)
pdfCreate.SaveAs(pdf_path, wdFormatPDF)

i solved this problem and fixed the code has following
import os
import win32com.client
import re
path = (r'D:\programing\test')
word_file_names = []
word = win32com.client.Dispatch('Word.Application')
for dirpath, dirnames, filenames in os.walk(path):
for f in filenames:
if f.lower().endswith(".docx") :
new_name = f.replace(".docx", ".pdf")
in_file =(dirpath + '/'+ f)
new_file =(dirpath + '/' + new_name)
doc = word.Documents.Open(in_file)
doc.SaveAs(new_file, FileFormat = 17)
doc.Close()
if f.lower().endswith(".doc"):
new_name = f.replace(".doc", ".pdf")
in_file =(dirpath +'/' + f)
new_file =(dirpath +'/' + new_name)
doc = word.Documents.Open(in_file)
doc.SaveAs(new_file, FileFormat = 17)
doc.Close()
word.Quit()

QR Code code inserting '[' and ']', how to stop this?

I'm trying to loop through a list of ~3,000 URLs and create QR codes for them. In one column I have the URLs and in another column I have what I want the QR code file names to be named when output as images.
The problem is the URLs that get converted to QR codes and my file names both come out encased in brackets.
For example:
URL Filename
www.abel.com Abel
Comes out as:
URL in QR Code Filename of QR Code
[www.abel.com] [Abel]
Here's my code so far:
import csv
import qrcode
import pandas as pd
df = pd.read_csv('QR_Python_Test.csv')
i = 1
x = df.iloc[[i]]
print(
x.QR_Code_Name.values)
for i in df.index:
z = df.iloc[[i]]
x = str(z.Link_Short.values)
qr = qrcode.QRCode(version=5, error_correction=qrcode.constants.ERROR_CORRECT_L,box_size=5,border=2,)
qr.add_data(x)
qr.make(fit=True)
img = qr.make_image()
file_name = str(z.QR_Code_Name.values) + ".png"
print('Saving %s' % file_name)
image_file = open(file_name, "w")
img.save(file_name)
image_file.close()
file.close()
And some sample data:
URL Filename
www.apple.com Apple
www.google.com Google
www.microsoft.com Microsoft
www.linux.org Linux
Thank you for your help,
Me

If your DataFrame contains the correct information, you can use DataFrame.itertuples
also separate the functions
reading the data from the file
generating the qr-code
saving the files
That way, you can test each of these individually
def generate_images(df):
for row in df.itertuples():
yield row.Filename, generate_qr(row.URL)
def generate_qr(url):
qr = qrcode.QRCode(version=5, error_correction=qrcode.constants.ERROR_CORRECT_L,box_size=5,border=2,)
qr.add_data(url)
qr.make(fit=True)
return qr.make_image()
def save_qr_code(qr_codes):
for filename, qr_code in qr_codes:
filename = filename + '.png'
print('saving to file %s' % (filename,)
with open(filename, 'wb') as file:
qr_code.save(file)
df = pd.read_csv('my_data.csv')
qr_codes = generate_images(df)
save_qr_code(qr_codes)

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

How can I merge two PDF files without overlapping content - python

Have you tried def combine_pdfs(dir_path1, dir_path2, save_path): pdf1 = pdfrw.PdfReader(dir_path1) pdf2 = pdfrw.PdfReader(dir_path2) pdf_writer = pdfrw.PdfWriter() for page in pdf1.pages: pdf_writer.addpage(page) for page in pdf2.pages: pdf_writer.addpage(page) pdf_writer.write(save_path)

Related

adding a prefix to a file name while downloading from URL via a CSV sheet with pandas

Python: Change file names to the names of people in a list

Updating .txt file with year from subfolder

Convert Microsoft Word document to PDF using Python

QR Code code inserting '[' and ']', how to stop this?

Categories

Resources