Example of PDF: "Smith#00$Consolidated_Performance.pdf"
The goal is to add a bookmark to page 1 of each PDF based on the filename.
(Bookmark name in example would be "Consolidated Performance")
import os
from openpyxl import load_workbook
from PyPDF2 import PdfFileMerger
cdir = "Directory of PDF" # Current directory
pdfcdir = [filename for filename in os.listdir(cdir) if filename.endswith(".pdf")]
def addbookmark(f):
output = PdfFileMerger()
name = os.path.splitext(os.path.basename(f))[0] # Split filename from .pdf extension
dp = name.index("$") + 1 # Find position of $ sign
bookmarkname = name[dp:].replace("_", " ") # replace underscores with spaces
output.addBookmark(bookmarkname, 0, parent=None) # Add bookmark
output.append(open(f, 'rb'))
output.write(open(f, 'wb'))
for f in pdfcdir:
addbookmark(f)
The UDF works fine when applied to individual PDFs, but it won't add the bookmarks when put into the loop at the bottom of the code. Any ideas on how to make the UDF loop through all PDFs within pdfcdir?
I'm pretty sure that the issue you're having has nothing to do with the loop. Rather, you're passing just the filenames and not including the directory path. It's trying to open these files in the script's current working directory (the directory the script is in, by default) rather than in the directory you read the filenames from.
So, join the directory name with each file name when calling your function.
for f in pdfcdir:
addbookmark(os.path.join(cdir, f))
Related
I have a function that opens a zip file, finds a pdf with a given filename, then reads the first page of the pdf to get some specific text. My issue is that after I locate the correct file, I can't open it to read it. I have tried to use a relative path within the zip folder and a absolute path in my downloads folder and I keep getting the error:
no such file: 'Deliverables_Rev B\Plans_Rev B.pdf'
no such file: 'C:\Users\MyProfile\Downloads\Deliverables_Rev B\Plans_Rev B.pdf'
I have been commenting out the os.path.join line to change between the relative and absolute path as self.prefs['download_path'] returns my download folder.
I'm not sure what the issue with with the relative path is, any insight would be helpful, as I think it has to do with trying to read out of a zipped folder.
import zipfile as ZipFile
import fitz
def getjobcode(self, filename):
if '.zip' in filename:
with ZipFile(filename, 'r') as zipObj:
for document in zipObj.namelist():
if 'plans' in document.lower():
document = os.path.join(self.prefs['download_path'], document)
doc = fitz.open(document)
page1 = doc.load_page(0)
page1text = page1.get_text('text')
jobcode = page1text[page1text.index(
'PROJECT NUMBER'):page1text.index('PROJECT NUMBER') + 30][-12:]
return jobcode
I ended up extracting the zip folder into the downloads folder then parsing the pdf to get the data I needed. Afterwords I created a job folder where I wanted it and moved the extracted folder into it from the downloads folder.
I'm trying to copy all pictures from one directory (also including subdirectories) to another target directory. Whenever the exact picture name is found in one of the xml files the tool should grap all information (attributes in the parent and child nodes) and create subdirectories based on those node informations, also it should rename the picture file.
The part when it extracts all the information from the nodes is already done.
from bs4 import BeautifulSoup as bs
path_xml = r"path\file.xml"
content = []
with open(res, "r") as file:
content = file.readlines()
content = "".join(content)
def get_filename(_content):
bs_content = bs(_content, "html.parser")
# some code
picture_path = f'{pm_1}{pm_2}\{pm_3}\{pm_4}\{pm_5}_{pm_6}_{pm_7}\{pm_8}\{pm_9}.jpg'
get_filename(content)
So in the end I get a string value with the directory path and the file name I want.
Now I struggle with opening all xml files in one directory instead of just opening one file. I tryed this:
import os
dir_xml = r"path"
res = []
for path in os.listdir(dir_xml):
if os.path.isfile(os.path.join(dir_xml, path)):
res.append(path)
with open(res, "r") as file:
content = file.readlines()
but it gives me this error: TypeError: expected str, bytes or os.PathLike object, not list
How can i read through all xml files instead of just one? I have hundreds of xml files so that will take a wile :D
And another question: How can i create directories base on string?
Lets say the value of picture_path is AB\C\D\E_F_G\H\I.jpg
I would need another directory path for the destination of the created folders and a function that somehow creates folders based on that string. How can I do that?
To read all XML files in a directory, you can modify your code as follows:
import os
dir_xml = r"path"
for path in os.listdir(dir_xml):
if path.endswith(".xml"):
with open(os.path.join(dir_xml, path), "r") as file:
content = file.readlines()
content = "".join(content)
get_filename(content)
This code uses the os.listdir() function to get a list of all files in the directory specified by dir_xml. It then uses a for loop to iterate over the list of files, checking if each file ends with the .xml extension. If it does, it opens the file, reads its content, and passes it to the get_filename function.
To create directories based on a string, you can use the os.makedirs function. For example:
import os
picture_path = r'AB\C\D\E_F_G\H\I.jpg'
dest_path = r'path_to_destination'
os.makedirs(os.path.join(dest_path, os.path.dirname(picture_path)), exist_ok=True)
In this code, os.path.join is used to combine the dest_path and the directory portion of picture_path into a full path. os.path.dirname is used to extract the directory portion of picture_path. The os.makedirs function is then used to create the directories specified by the path, and the exist_ok argument is set to True to allow the function to succeed even if the directories already exist.
Finally, you can use the shutil library to copy the picture file to the destination and rename it, like this:
import shutil
src_file = os.path.join(src_path, picture_path)
dst_file = os.path.join(dest_path, picture_path)
shutil.copy(src_file, dst_file)
Here, src_file is the full path to the source picture file and dst_file is the full path to the destination. The shutil.copy function is then used to copy the file from the source to the destination.
You can use os.walk() for recursive search of files:
import os
dir_xml = r"path"
for root, dirs, files in os.walk(dir_xml): #topdown=False
for names in files:
if ".xml" in names:
print(f"file path: {root}\n XML-Files: {names}")
with open(names, 'r') as file:
content = file.readlines()
I'm very new to Python/programming, and am trying to automate an office task that is very time consuming.
I have multiple folders with PDFs. For each folder I need to combine the PDFs into one PDF, and save it inside the folder whose contents it's the sum of. I've gotten the contents of one folder combined, and saved to my desktop successfully using this:
import PyPDF2
import os
Path = '/Users/jlaw/Desktop/Testing/FolderName/'
filelist = os.listdir(Path)
pdfMerger = PyPDF2.PdfFileMerger(strict=False)
for file in filelist:
if file.endswith('.pdf'):
pdfMerger.append(Path+file)
pdfOutput = open('Tab C.pdf', 'wb')
pdfMerger.write(pdfOutput)
pdfOutput.close()`
With the below code I'm trying to do the above, but for all the folders in a specific directory. When I run this, I get "Tab C.pdf" files appearing correctly, but I'm unable to open them.
import PyPDF2
import os
Path = '/Users/jlaw/Desktop/Testing/'
folders = os.listdir(Path)
def pdf_merge(filelist, foldername):
pdfMerger = PyPDF2.PdfFileMerger()
for file in filelist:
if file.endswith('.pdf'):
pdfMerger.append(Path+foldername+"/"+file)
pdfOutput = open(Path+foldername+'/Tab C.pdf', 'wb')
pdfMerger.write(pdfOutput)
pdfOutput.close()
for folder in folders:
pdf_merge(Path+'/'+folder, folder)`
I'm using Python Version: 3.8
The Tab C.pdf files are only 1kb in size. When I try and open with Adobe Acrobat, a pop up says, "There was an error opening this document. This file cannot be opened because it has no pages. If I try Chrome, it will open, but it's just an empty PDF, and with Edge (Chromium based) it says, 'We can't open this file. Something went Wrong"
Any pieces of advice or hints are much appreciated.
The below works. I'm not experienced enough yet to know why this is works, while the above doesn't.
import PyPDF2
import os
Path = 'C:/Users/jlaw/Desktop/Testing/'
folders = os.listdir(Path)
pdfMerger = PyPDF2.PdfFileMerger()
def pdf_merge(filelist): #Changed to just one argument
pdfMerger = PyPDF2.PdfFileMerger()
for file in os.listdir(filelist): #added os.listdir()
if file.endswith('.pdf'):
pdfMerger.append(filelist+'/'+file) #replaced Path+foldername with filelist
pdfOutput = open(Path+folder+'/Tab C.pdf', 'wb') #Moved back one tab to prevent infinite loop
pdfMerger.write(pdfOutput) #Moved back one tab to prevent infinite loop
pdfOutput.close() #Moved back one tab to prevent infinite loop
for folder in folders:
pdf_merge(Path+folder)` #Removed redundant + "/"
I am trying to rename a set of files in a directory using python. The files are currently labelled with a Pool number, AR number and S number (e.g. Pool1_AR001_S13__fw_paired.fastq.gz.) Each file refers to a specific plant sequence name. I would like to rename these files by removing the 'Pool_AR_S' and replacing it with the sequence name e.g. 'Lbienne_dor5_GS1', while leaving the suffix (e.g. fw_paired.fastq.gz, rv_unpaired.fastq.gz), I am trying to read the files into a dictionary, but I am stuck as to what to do next. I have a .txt file containing the necessary information in the following format:
Pool1_AR010_S17 - Lbienne_lla10_GS2
Pool1_AR011_S18 - Lbienne_lla10_GS3
Pool1_AR020_S19 - Lcampanulatum_borau4_T_GS1
The code I have so far is:
from optparse import OptionParser
import csv
import os
parser = OptionParser()
parser.add_option("-w", "--wanted", dest="w")
parser.add_option("-t","--trimmed", dest="t")
parser.add_option("-d", "--directory", dest="working_dir", default="./")
(options, args) = parser.parse_args()
wanted_file = options.w
trimmomatic_output = options.t
#Read the wanted file and create a dictionary of index vs species identity
with open(wanted_file, 'rb') as species_sequence:
species_list = list(csv.DictReader(species_sequence, delimiter='-'))
print species_list
#Rename the Trimmomatic Output files according to the dictionary
for trimmed_sequence in os.listdir(trimmomatic_output):
os.rename(os.path.join(trimmomatic_output, trimmed_sequence),
os.path.join(trimmomatic_output, trimmed_sequence.replace(species_list[0], species_list[1]))
Please can you help me to replace half of the . I'm very new to python and to stack overflow, so I am sorry if this question has been asked before or if I have asked this in the wrong place.
First job is to get rid of all those modules. They may be nice, but for a job like yours they are very unlikely to make things easier.
Create a .py file in the directory where those .gz files reside.
import os
files = os.listdir() #files is of list type
#'txt_file' is the path of your .txt file containing those conversions
dic=parse_txt(txt_file) #omitted the body of parse_txt() func.Should return a dictionary by parsing that .txt file
for f in files:
pre,suf=f.split('__') #"Pool1_AR001_S13__(1)fw_paired.fastq.gz"
#(1)=assuming prefix and suffix are divided by double underscore
pre = dic[pre]
os.rename(f,pre+'__'+suf)
If you need help with parse_txt() function, let me know.
Here is a solution that I tested with Python 2. Its fine if you use your own logic instead of the get_mappings function. Refer comments in code for explanation.
import os
def get_mappings():
mappings_dict = {}
with(open('wanted_file.txt', 'r')) as f:
for line in f:
# if you have Pool1_AR010_S17 - Lbienne_lla10_GS2
# it becomes a list i.e ['Pool1_AR010_S17 ', ' Lbienne_lla10_GS2']
#note that there may be spaces before/after the names as shown above
text = line.split('-')
#trim is used to remove spaces in the names
mappings_dict[text[0].strip()] = text[1].strip()
return mappings_dict
#PROGRAM EXECUTION STARTS FROM HERE
#assuming all files are in the current directory
# if not replace the dot(.) with the path of the directory where you have the files
files = os.listdir('.')
wanted_names_dict = get_mappings()
for filename in files:
try:
#prefix='Pool1_AR010_S17', suffix='fw_paired.fastq.gz'
prefix, suffix = filename.split('__')
new_filename = wanted_names_dict[prefix] + '__' + suffix
os.rename(filename, new_filename)
print 'renamed', filename, 'to', new_filename
except:
print 'No new name defined for file:' + filename
I have a directory containing multiple files.
The name of the files follows this pattern 4digits.1.4digits.[barcode]
The barcode specifies each file and it is composed by 7 leters.
I have a txt file where in one column I have that barcode and in the other column the real name of the file.
What I would like to do is to right a pyhthon script that automatically renames each file according to the barcode to it s new name written in the txt file.
Is there anybody that could help me?
Thanks a lot!
I will give you the logic:
1. read the text file that contains barcode and name.http://www.pythonforbeginners.com/files/reading-and-writing-files-in-python.
for each line in txt file do as follows:
2. Assign the value in first(barcode) and second(name) column in two separate variables say 'B' and 'N'.
3. Now we have to find the filename which has the barcode 'B' in it. the link
Find a file in python will help you do that.(first answer, 3 rd example, for your case the name you are going to find will be like '*B')
4. The previous step will give you the filename that has B as a part. Now use the rename() function to rename the file to 'N'. this link will hep you.http://www.tutorialspoint.com/python/os_rename.htm
Suggestion: Instead of having a txt file with two columns. You can have a csv file, that would be easy to handle.
The following code will do the job for your specific use-case, though can make it more general purpose re-namer.
import os # os is a library that gives us the ability to make OS changes
def file_renamer(list_of_files, new_file_name_list):
for file_name in list_of_files:
for (new_filename, barcode_infile) in new_file_name_list:
# as per the mentioned filename pattern -> xxxx.1.xxxx.[barcode]
barcode_current = file_name[12:19] # extracting the barcode from current filename
if barcode_current == barcode_infile:
os.rename(file_name, new_filename) # renaming step
print 'Successfully renamed %s to %s ' % (file_name, new_filename)
if __name__ == "__main__":
path = os.getcwd() # preassuming that you'll be executing the script while in the files directory
file_dir = os.path.abspath(path)
newname_file = raw_input('enter file with new names - or the complete path: ')
path_newname_file = os.path.join(file_dir, newname_file)
new_file_name_list = []
with open(path_newname_file) as file:
for line in file:
x = line.strip().split(',')
new_file_name_list.append(x)
list_of_files = os.listdir(file_dir)
file_renamer(list_of_files, new_file_name_list)
Pre-assumptions:
newnames.txt - comma
0000.1.0000.1234567,1234567
0000.1.0000.1234568,1234568
0000.1.0000.1234569,1234569
0000.1.0000.1234570,1234570
0000.1.0000.1234571,1234571
Files
1111.1.0000.1234567
1111.1.0000.1234568
1111.1.0000.1234569
were renamed to
0000.1.0000.1234567
0000.1.0000.1234568
0000.1.0000.1234569
The terminal output:
>python file_renamer.py
enter file with new names: newnames.txt
The list of files - ['.git', '.idea', '1111.1.0000.1234567', '1111.1.0000.1234568', '1111.1.0000.1234569', 'file_renamer.py', 'newnames.txt.txt']
Successfully renamed 1111.1.0000.1234567 to 0000.1.0000.1234567
Successfully renamed 1111.1.0000.1234568 to 0000.1.0000.1234568
Successfully renamed 1111.1.0000.1234569 to 0000.1.0000.1234569