Search in directory and subdirectories for missing files

Search in directory and subdirectories for missing files - python

I am trying to search through a directory and associated subdirectories to see if these listed jpg files are missing. I have got it looping through one directory but cannot extend the search into any subdirectories.
I have tried using os.walk but it just loops through all files and repeats that all files are missing even if they are not. So I am not sure how to proceed.
This is the code that I have so far.
source = 'path_to_file'
paths = ['Hello', 'Hi', 'Howdy']
for index, item in enumerate(paths):
paths[index] = (source + '\\' + paths[index]+'.jpg')
mp = [path for path in paths if not isfile(path)]
for nl in mp:
print(f'{nl}... is missing')

As you told that using os.walk you were unable to get your desired output, Here's a solution.
What i have done is using os os.walk i have searched the whole directory and then appended the file names to a list called emty_list. Then i have tried to check if the item in the list file_name is in emty_list or not.
import os
source = r'path'
emty_list=[]
file_name= ['hello.jpg', 'Hi.jpg', 'Howdy.jpg']
for root, dirs, files in os.walk(f"{source}", topdown=False): #Listing Directory
for name in files:
emty_list.append(name)
for check in file_name:
if check not in emty_list:
print(f"File Not Found Error : File Name: {check}")
Note: Please check if the file that you have created in your system is for example Hello.jpg not hello.jpg.

You can leverage glob and the recursive parameter to do that in python :
import glob
source='./'
paths=['file1','file2','file3']
for path in paths:
print(f"looking for {path} with {source+'**/'+path+'.jpg'}")
print(glob.glob(source+"**/"+path+".jpg",recursive=True))
mp=[path for path in paths if not glob.glob(source+"**/"+path+".jpg",recursive=True)]
for nl in mp:
print(f'{nl}... is missing')
(You can the remove the for loop line 5-7, it's just to clarify how glob works, the comprehension list itself is enough)
With the following folder :
.
├── file1.jpg
├── search.py
└── subfolder
└── file3.jpg
It returns :
looking for file1 with ./**/file1.jpg
['./file1.jpg']
looking for file2 with ./**/file2.jpg
[]
looking for file3 with ./**/file3.jpg
['./subfolder/file3.jpg']
file2... is missing

Related

Python3 solution to create all parent directories and file from a relative path [Windows]

I am trying to create all non-existing directories and the file from a relative path but am unable to do so.
Example:
import os
path = os.path.join("folder1", "folder2", "folder3", "file.txt")
os.makedirs(path) # this creates a directory called 'file.txt' instead of a file.
I would like to have the following:
folder1 > folder2 > folder3 > file.txt
Note: Would be great if anyone has any one-liner solutions for this.

As far as I know that doesn't exist, but using this function, it can become a one liner:
def create_file_in_folders(p):
folders = p.rsplit("\\", maxsplit= 1)[0]
os.makedirs(folders)
with open(p, "w") as f: f.write("")
p = os.path.join("folder2", "folder2", "text.txt")
create_file_in_folders(p)
>>> # now you should find an empty txt file in the newly created folders

Get files from specific folders in python

I have the following directory structure with the following files:
Folder_One
├─file1.txt
├─file1.doc
└─file2.txt
Folder_Two
├─file2.txt
├─file2.doc
└─file3.txt
I would like to get only the .txt files from each folder listed. Example:
Folder_One-> file1.txt and file2.txt
Folder_Two-> file2.txt and file3.txt
Note: This entire directory is inside a folder called dataset. My code looks like this, but I believe something is missing. Can someone help me.
path_dataset = "./dataset/"
filedataset = os.listdir(path_dataset)
for i in filedataset:
pasta = ''
pasta = pasta.join(i)
for file in glob.glob(path_dataset+"*.txt"):
print(file)

from pathlib import Path
for path in Path('dataset').rglob('*.txt'):
print(path.name)
Using glob
import glob
for x in glob.glob('dataset/**/*.txt', recursive=True):
print(x)

You can use re module to check that filename ends with .txt.
import re
import os
path_dataset = "./dataset/"
l = os.listdir(path_dataset)
for e in l:
if os.path.isdir("./dataset/" + e):
ll = os.listdir(path_dataset + e)
for file in ll:
if re.match(r".*\.txt$", file):
print(e + '->' + file)

One may use an additional option to check and find all files by using the os module (this is of advantage if you already use this module):
import os
#get current directory, you may also provide an absolute path
path=os.getcwd()
#walk recursivly through all folders and gather information
for root, dirs, files in os.walk(path):
#check if file is of correct type
check=[f for f in files if f.find(".txt")!=-1]
if check!=[]:print(root,check)

How to create tar.gz archive in Python/tar without include parent directory?

I have a FolderA which contains FolderB and FileB. How can I create a tar.gz archive which ONLY contains FolderB and FileB, removing the parent directory FolderA? I'm using Python and I'm running this code on a Windows machine.
The best lead I found was: How to create full compressed tar file using Python?
In the most upvoted answer, people discuss ways to remove the parent directory, but none of them work for me. I've tried arcname, os.walk, and running the tar command via subprocess.call ().
I got close with os.walk, but in the code below, it still drops a " _ " directory in with FolderB and FileB. So, the file structure is ARCHIVE.tar.gz > ARCHIVE.tar > "_" directory, FolderB, FileB.
def make_tarfile(output_filename, source_dir):
with tarfile.open(output_filename, "w:gz") as tar:
length = len(source_dir)
for root, dirs, files in os.walk(source_dir):
folder = root[length:] # path without "parent"
for file in files:
tar.add(os.path.join(root, folder), folder)
I make the archive using:
make_tarfile('ARCHIVE.tar.gz', 'C:\FolderA')
Should I carry on using os.walk, or is there any other way to solve this?
Update
Here is an image showing the contents of my archive. As you can see, there is a " _ " folder in my archive that I want to get rid of--oddly enough, when I extract, only FolderA and FileB.html appear as archived. In essence, the behavior is correct, but if I could go the last step of removing the " _ " folder from the archive, that would be perfect. I'm going to ask an updated question to limit confusion.

This works for me:
with tarfile.open(output_filename, "w:gz") as tar:
for fn in os.listdir(source_dir):
p = os.path.join(source_dir, fn)
tar.add(p, arcname=fn)
i.e. Just list the root of the source dir and add each entry to the archive. No need for walking the source dir as adding a directory via tar.add() is automatically recursive.

I've tried to provide some examples of how changes to the source directory makes a difference to what finally gets extracted.
As per your example, I have this folder structure
I have this python to generate the tar file (lifted from here)
import tarfile
import os
def make_tarfile(output_filename, source_dir):
with tarfile.open(output_filename, "w:gz") as tar:
tar.add(source_dir, arcname=os.path.basename(source_dir))
What data and structure is included in the tar file depends on what location I provide as a parameter.
So this location parameter,
make_tarfile('folder.tar.gz','folder_A/' )
will generate this result when extracted
If I move into folder_A and reference folder_B,
make_tarfile('folder.tar.gz','folder_A/folder_B' )
This is what the extract will be,
Notice that folder_B is the root of this extract.
Now finally,
make_tarfile('folder.tar.gz','folder_A/folder_B/' )
Will extract to this
Just the file is included in the extract.

Here is a function to perform the task. I have had some issues extracting the tar on Windows (with WinRar) as it seemed to try to extract the same file twice, but I think it will work fine when extracting the archive properly.
"""
The directory structure I have is as follows:
├───FolderA
│ │ FileB
│ │
│ └───FolderB
│ FileC
"""
import tarfile
import os
# This is where I stored FolderA on my computer
ROOT = os.path.join(os.path.dirname(__file__), "FolderA")
def make_tarfile(output_filename: str, source_dir: str) -> bool:
"""
:return: True on success, False otherwise
"""
# This is where the path to each file and folder will be saved
paths_to_tar = set()
# os.walk over the root folder ("FolderA") - note it will never get added
for dirpath, dirnames, filenames in os.walk(source_dir):
# Resolve path issues, for example for Windows
dirpath = os.path.normpath(dirpath)
# Add each folder and path in the current directory
# Probably could use zip here instead of set unions but can't be bothered to try to figure it out
paths_to_tar = paths_to_tar.union(
{os.path.join(dirpath, d) for d in dirnames}).union(
{os.path.join(dirpath, f) for f in filenames})
try:
# This will create the tar file in the current directory
with tarfile.open(output_filename, "w:gz") as tar:
# Change the directory to treat all paths relatively
os.chdir(source_dir)
# Finally add each path using the relative path
for path in paths_to_tar:
tar.add(os.path.relpath(path, source_dir))
return True
except (tarfile.TarError, OSError) as e:
print(f"An error occurred - {e}")
return False
if __name__ == '__main__':
make_tarfile("tarred_files.tar.gz", ROOT)

You could use subprocess to achieve something similar but much faster.
def make_tarfile(output_filename, source_dir):
subprocess.call(["tar", "-C", source_dir, "-zcvf", output_filename, "."])

how to get a folder name and file name in python

I have a python program named myscript.py which would give me the list of files and folders in the path provided.
import os
import sys
def get_files_in_directory(path):
for root, dirs, files in os.walk(path):
print(root)
print(dirs)
print(files)
path=sys.argv[1]
get_files_in_directory(path)
the path i provided is D:\Python\TEST and there are some folders and sub folder in it as you can see in the output provided below :
C:\Python34>python myscript.py "D:\Python\Test"
D:\Python\Test
['D1', 'D2']
[]
D:\Python\Test\D1
['SD1', 'SD2', 'SD3']
[]
D:\Python\Test\D1\SD1
[]
['f1.bat', 'f2.bat', 'f3.bat']
D:\Python\Test\D1\SD2
[]
['f1.bat']
D:\Python\Test\D1\SD3
[]
['f1.bat', 'f2.bat']
D:\Python\Test\D2
['SD1', 'SD2']
[]
D:\Python\Test\D2\SD1
[]
['f1.bat', 'f2.bat']
D:\Python\Test\D2\SD2
[]
['f1.bat']
I need to get the output this way :
D1-SD1-f1.bat
D1-SD1-f2.bat
D1-SD1-f3.bat
D1-SD2-f1.bat
D1-SD3-f1.bat
D1-SD3-f2.bat
D2-SD1-f1.bat
D2-SD1-f2.bat
D2-SD2-f1.bat
how do i get the output this way.(Keep in mind the directory structure here is just an example. The program should be flexible for any path). How do i do this.
Is there any os command for this. Can you Please help me solve this? (Additional Information : I am using Python3.4)

You could try using the glob module instead:
import glob
glob.glob('D:\Python\Test\D1\*\*\*.bat')
Or, to just get the filenames
import os
import glob
[os.path.basename(x) for x in glob.glob('D:\Python\Test\D1\*\*\*.bat')]

To get what you want, you could do the following:
def get_files_in_directory(path):
# Get the root dir (in your case: test)
rootDir = path.split('\\')[-1]
# Walk through all subfolder/files
for root, subfolder, fileList in os.walk(path):
for file in fileList:
# Skip empty dirs
if file != '':
# Get the full path of the file
fullPath = os.path.join(root,file)
# Split the path and the file (May do this one and the step above in one go
path, file = os.path.split(fullPath)
# For each subfolder in the path (in REVERSE order)
subfolders = []
for subfolder in path.split('\\')[::-1]:
# As long as it isn't the root dir, append it to the subfolders list
if subfolder == rootDir:
break
subfolders.append(subfolder)
# Print the list of subfolders (joined by '-')
# + '-' + file
print('{}-{}'.format( '-'.join(subfolders), file) )
path=sys.argv[1]
get_files_in_directory(path)
My test folder:
SD1-D1-f1.bat
SD1-D1-f2.bat
SD2-D1-f1.bat
SD3-D1-f1.bat
SD3-D1-f2.bat
It may not be the best way to do it, but it will get you what you want.

Python script to extract all subdirectories according to filename

My directory contains several folders, each with several subdirectories of their own. I need to move all of the files that contain 'Volume.csv' into a directory called Volume.
Folder1
|---1Area.csv
|---1Circumf.csv
|---1Volume.csv
Folder2
|---2Area.csv
|---2Circumf.csv
|---2Volume.csv
Volume
I'm trying combinations of os.walk and regex to retrieve the files by filename but not having much luck.
Any ideas?
Thank you!
Sunworshipper, thank you for the answer!
I ran the following code and it moved the entire directory rather than just file name containing 'Volume'. Is it clear why that happened?
import os
import shutil
source_dir = "~/Stats/"
dest_dir = "~/Stats/Volume/"
file_paths = set()
for dir_, _, files in os.walk(source_dir):
for fileName in files:
if "Volume" in fileName:
relDir = os.path.relpath(dir_, source_dir)
file_paths.add(relDir)
for matched in file_paths:
shutil.move(matched, dest_dir)

You can use glob for this. It returns a list of path names matching the expression you give it.
import glob
import shutil
dest = 'testfiles/'
files = glob.glob('*/*test.csv')
for file in files:
shutil.move(file, dest)
I used relative paths but you can also use absolute paths.
shutil moves the documents to the new location. See the glob.glob documentation for more info.

import os
import shutil
Setup your source and destination directories
source_dir = "/Users/nenad/Documents/Python Files/Random Tests"
dest_dir = "/Users/nenad/Documents/Python Files/Random Tests/volume"
This set will now hold paths of all files matching your substring.
file_paths = set()
Now I only consider the directories that contain a file which has a substring "hello" in the filename.
for dir_, _, files in os.walk(source_dir):
for fileName in files:
if "hello" in fileName:
relDir = os.path.relpath(dir_, source_dir)
relFile = os.path.join(relDir, fileName)
file_paths.add(relFile)
And now you just move them to your destination with shutil.
for matched in file_paths:
shutil.move(matched, dest_dir)
Sorry for the misread :)
Best regards

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.