Read all PDFs in a directory (image) - python

I have attached an image to help show what I've done. I'm trying to write a program that will add a blank page to all PDFs in the directory that have an odd number of pages. However I can't seem to read all the PDFs in a directory.
The script I have works on a single PDF, but I have 1000's of these to do. Why can't I read all the PDFs in the user_input directory?
Screenshot of code and error here
code is here
from PyPDF2 import PdfFileReader, PdfFileWriter, PdfFileMerger
import os
user_input = input("Enter the path of your file: ")
files = os.listdir(user_input)
for file in files:
print(file)
pdfReader = PdfFileReader(open(files, 'rb'))

Use following code. This code will give all list of pdf files from directory
import glob, os
def readfiles():
os.chdir(path)
pdfs = []
for file in glob.glob("*.pdf"):
print(file)
pdfs.append(file)

In order to process every PDF file in the folder, you need a few things.
get to the right directory
get all files
get only the PDF files
OS is perfect for this. It can get all the files and then let you determine what to do with them. One problem I had (may be yours as well) was that my path had spaces in it, and os.chdir() was looking at the path ("something\ long\ with\ spaces/abcd/pdf\ folder") and was replacing all the spaces with "\ " meaning my final path was "something\ long\ with\ spaces/abcd/pdf\ folder" which is not a valid path. Removing the "\" from the original user input worked just fine. Let me know if you need any further help.
import os
os.chdir(raw_input("enter the path: ").replace("\\", ""))
print os.listdir(".")
for file in os.listdir("."):
if file.endswith(".pdf"):
print file
process(file) # do whatever it is you need to here

Is the .py file in the same directory as the pdfs? If not, you will need the full path to read the file not just the filename, which is returned by os.listdir

Related

python: set file path to only point to files with a specific ending

I am trying to run a program with requires pVCF files alone as inputs. Due to the size of the data, I am unable to create a separate directory containing the particular files that I need.
The directory contains multiple files with 'vcf.gz.tbi' and 'vcf.gz' endings. Using the following code:
file_url = "file:///mnt/projects/samples/vcf_format/*.vcf.gz"
I tried to create a file path that only grabs the '.vcf.gz' files while excluding the '.vcf.gz.tbi' but I have been unsuccesful.
The code you have, as written, is just assigning your file path to the variable file_url. For something like this, glob is popular but isn't the only option:
import glob, os
file_url = "file:///mnt/projects/samples/vcf_format/"
os.chdir(file_url)
for file in glob.glob("*.vcf.gz"):
print(file)
Note that the file path doesn't contain the kind of file you want (in this case, a gzipped VCF), the glob for loop does that.
Check out this answer for more options.
It took some digging but it looks like you're trying to use the import_vcf function of Hail. To put the files in a list so that it can be passed as input:
import glob, os
file_url = "file:///mnt/projects/samples/vcf_format/"
def get_vcf_list(path):
vcf_list = []
os.chdir(path)
for file in glob.glob("*.vcf.gz"):
vcf_list.append(path + "/" + file)
return vcf_list
get_vcf_list(file_url)
# Now you pass 'get_vcf_list(file_url)' as your input instead of 'file_url'
mt = hl.import_vcf(get_vcf_list(file_url), force_bgz=True, reference_genome="GRCh38", array_elements_required=False)

File based strings/variables to set file path etc in python operation

I am trying to create part of a program that will take the values found in two CFG files and use them to determine what filetype to search for as well as what folder location to use. The code I found online sort of suits my needs, However I would like to not use a hard coded file path. Here is the code I have modified so far:
import glob
location = open("config.cfg", encoding = 'cp1252')
location = location.read()
filetype = open("filetype.cfg", encoding = 'cp1252')
filetype = filetype.read()
fileset = [file for file in glob.glob(location + filetype, recursive=True)]
print(location)
print(filetype)
for file in fileset:
print(file)
The config.cfg contains one line, which is the file path to a folder with 3 sample JPG files in it.
C:/test
The filetype.cfg contains one line as well, which is the file type to search for
"**/*.jpg"
I've gotten to the point where this code throws no errors, but it also doesn't work as intended either, it seems to read the files properly, but doesn't list the files in the folder. The Config.CFG file contains the folder path, i.e. C:/test, while the filetype.cfg contains "**/*.jpg", which is the type of file I would like searched for. I found the original code here: https://www.techbeamers.com/python-list-all-files-directory/, Look under the 'glob' method.
The original (fully working) code from the link above:
import glob
location = 'c:/test/temp/'
fileset = [file for file in glob.glob(location + "**/*.py", recursive=True)]
for file in fileset:
print(file)
Using Python 3.8 64bit on Windows 10.
Moved from an edit to the question by the OP to an answer.
Remove the quotes around "**/*.jpg" in the filetype.cfg file:
**/*.jpg

Moving Files: Matching Partial File/Directory Criteria (lastName, firstName) - Glob, Shutil

EDIT: ANSWER Below is the answer to the question. I will leave all subsequent text there just to show you how difficult I made such an easy task..
from pathlib import Path
import shutil
base = "C:/Users/Kenny/Documents/Clients"
for file in Path("C:/Users/Kenny/Documents/Scans").iterdir():
name = file.stem.split('-')[0].rstrip()
subdir = Path(base, name)
if subdir.exists():
dest = Path(subdir, file.name)
shutil.move(file, dest)
Preface:
I'm trying to write code that will move hundreds of PDF files from a :/Scans folder into another directory based on the matching client's name. This question is linked below - a very kind person, Elis Byberi, helped assist me in correcting my original code. I'm encountering another problem though..
To see our discussion and a similar question discussed:
-Python- Move All PDF Files in Folder to NewDirectory Based on Matching Names, Using Glob or Shutil
Python move files from directories that match given criteria to new directory
Question: How can you move all of the named files in :/Scans to their appropriately matched folder in :/Clients.
Background: Here is a breakdown of my file folders to give you a better idea of what I'm trying to do.
Within :/Scans folder I have thousands of PDF files, manually renamed (I tried writing a program to auto-rename.. didn't work) based on client and content, such that the folder encloses PDFs labeled as follows:
lastName, firstName - [contentVariable]
(repeat the above 100,000x)
Within the :/C drive of my computer I have a folder named 'Clients' with sub-folders for each and every client, named similar to the pattern above, as 'lastName, firstName'
EDIT: The code below will move the entire Scans folder to the Clients folder, which is close, but not exactly what I need to be doing. I only need to move the files within Scans to the corresponding Client fold names.
import glob
import shutil
import os
source = "C:/Users/Kenny/Documents/Scans"
dest = "C:/Users/Kenny/Documents/Clients"
os.chdir("C:/Users/Kenny/Documents/Clients")
pattern = '*,*'
for x in glob.glob(pattern):
fileName = os.path.join(source, x)
print(fileName)
shutil.move(source, dest)
EDIT 2 - CLOSE!: The code below will move all the files in Scans to the Clients folder, which is close, but not exactly what I need to be doing. I need to get each file into the correct corresponding file folder within the Clients folder.
This is a step forward from moving the entire Scans folder I would think.
source = "C:/Users/Kenny/Documents/Scans"
dest = "C:/Users/Kenny/Documents/Clients"
for (dirpath, dirnames, filenames) in walk(source):
for file in filenames:
shutil.move(path.join(dirpath,file), dest)
I have the following code below as well, and I am aware it does not do what I want it to do, so I am definitely missing something..
import glob
import shutil
import os
path = "C:/Users/Kenny/Documents/Scans"
dirs = os.listdir(path)
for file in dirs:
print(file)
dest_dir = "C:/Users/Kenny/Documents/Clients/{^w, $w}?"
for file in glob.glob(r'C:Users/Kenny/Documents/Clients/{^w, $w}?'):
print(file)
shutil.move(file, dest_dir)
1) Should I use os.scandir instead of os.listdir ?
2) Am I moving in the correct direction if I modify the code as such:
import glob
import shutil
import os
path = "C:/Users/Kenny/Documents/Scans"
dirs = os.scandir(path)
for file in dirs:
print(file)
dest_dir = "C:/Users/Kenny/Documents/Clients/*"
for file in glob.glob(r'C:Users/Kenny/Documents/Clients, *'):
dest_dir = os.path.join(file, glob.glob)
shutil.move(file, dest_dir)
Note within the 'for file in glob.glob(r'C:Users/Kenny/Documents/Clients/{^w, $w}?' I have tried replacing 'Clients/{^w, $w}?' with just 'Clients/*'
For the above, I only need the file in :/Scans, written as, "lastName, firstName - [content]" to be matched and moved to /Clients/[lastName, firstName] --- the [content] does not matter. But there are both greedy and nongreedy expressions... which is why I'm unsure about using * or {^w, $w}? -- because we have clients with the same last names, but different first names.
The following error is generated when running the first command:
Error 1
Error 2
The following error (though, there is no error?) is generated when running the second command:
Error 3
EDIT/POSSIBLE ANSWER
Have not yet tested this but, fnmatch(filename, pattern), or, fnmatch.translate(pattern) can be used to test whether the filename string matches the pattern string, returning True or False.
From here perhaps you could write a conditional statement..
for file in os.listdir('.'):
if fnmatch.fnmatch(file, '*.txt'):
shutil.move(source, destination)
or
for file in os.listdir('.'):
if fnmatch.fnmatch(file, '*.txt'):
shutil.move(file.join(eachFile, source), destination)
I have not tested the two aforesaid codes. I have no idea if they work, but editing allows others to see how my train of thought is progressing.

Copy files (as backup) and change original file names (rearranging contents)

i'm a total python noob but i want to learn it and integrate it to my workflow.
I have about 400 files containing 4 different parts in the filename separated by an underline:
-> Version_Date_ProjectName_ProjectNumber
As we allways look at the Projectnumber first, we arranged the contents of the filename for new projects to:
-> ProjectNumber_Version_ProjektName
My Problem now is, that i like to rename all the existing files to be rearranged to the new format while having them backed up in a subdirectory called "Archiv".
It just has to be a simple script that i put in the directory and every file in this directory will be copied as backup and changed to the new filename.
EDIT:
My first step was to create a subfolder within the source directory, and it worked somehow. But no i saw, that i just need to backup the files with a specific file extension.
import os, shutil
src_dir= os.curdir
dst_dir= os.path.join(os.curdir, "Archiv")
shutil.copytree(src_dir, dst_dir)
i tried to extend the code with the solutions from here but it doesn't work out. :/
import os
import shutil
import glob
src_path = "YOU_SOURCE_PATH"
dest_path = "YOUR DESTINATION PATH"
if not os.path.exists(dest_path):
os.makedirs(dest_path)
files = glob.iglob(os.path.join(src_dir, "*.pdf"))
for file in files:
if os.path.isfile(file):
shutil.copy2(file, dest_path)

I want to use python to walk through directories to get to text files and processed them

I have many folders in a directory:
/home/me/Documents/coverage
/coverage contains 50 folders all beginning with H:
/home/me/Documents/coverage/H1 (etc)
In each H*** folder there is a text file which I need to extract data from.
I have been trying to use glob and os.walk to use a script that is saved in /coverage to walk into each of these H folders, open the .txt file and process it, but I have had no luck at all.
Would this be a good starting point? (where path = /coverage)
for filename in glob.glob(os.path.join(path, "H*")):
folder = open(glob.glob(H*))
And then try and open the .txt file?
Just gather all the txt files in one shot using glob wildcards.
You can do it like that.
import glob
path = "/home/me/Documents/coverage/H*/*.txt"
for filename in glob.glob(path):
fileStream = open(filename )
cheers

Categories

Resources