Missing a file with os.walk() - python

I'm trying to make a file listing tool for a colleague. The code is quite simple :
source = C:\Users\Documents\test
extension = '.txt'
file_list = []
lower_levels = False
for root, dirs, files in os.walk(source):
for n in files:
if n.endswith(extension):
file_list.append(n)
if (not lower_levels): #does not check lower levels
break
writing_in_excel(source, file_list) #output is an excel file
When testing it on my test folder, it works pretty fine, I get all my 121 files listed in the output.
However, when my colleague tries it, one file is missing compared to the number of files given by Windows (I verified, windows indicates 39735 files wih the right extension, for 39734 in the excel file) and given the number of files, it's hard to find out which file is missing.
The problem doesn't seem to come from the writing in excel, since I write the total number of files with len(file_list), and can already see that the file is missing in the list. I guess it comes from the walking in the directory ??
Does anyone know where the problem could come from ?
Thanks

There's probably an error that os.walk does not show by default. It can be handeled by setting the onerror parameter. Write an error handler:
def walk_error(error):
print(error.filename)
Then change your call to:
for root, dirs, files in os.walk(source, onerror=walk_error):

Looks like the problem came from the extension condition, one of the file had the extension in caps.
so I just replaced
if n.endswith(extension):
By
ext = os.path.splitext(n)[-1].lower() #get the current file extension in lower case
if ext== extension:
And it works !
Thank you for your help.

Related

Python search files in multiple subdirectories for specific string and return file path(s) if present

I would be very grateful indeed for some help for a frustrated and confused Python beginner.
I am trying to create a script that searches a windows directory containing multiple subdirectories and different file types for a specific single string (a name) in the file contents and if found prints the filenames as a list. There are approximately 2000 files in 100 subdirectories, and all the files I want to search don't necessarily have the same extension - but are all in essence, ASCII files.
I've been trying to do this for many many days but I just cannot figure it out.
So far I have tried using glob recursive coupled with reading the file but I'm so very bewildered. I can successfully print a list of all the files in all subdirectories, but don't know where to go from here.
import glob
files = []
files = glob.glob('C:\TEMP' + '/**', recursive=True)
print(files)
Can anyone please help me? I am 72 year old scientist trying to improve my skills and "automate the boring stuff", but at the moment I'm just losing the will.
Thank you very much in advance to this community.
great to have you here!
What you have done so far is found all the file paths, now the simplest way is to go through each of the files, read them into the memory one by one and see if the name you are looking for is there.
import glob
files = glob.glob('C:\TEMP' + '/**', recursive=True)
target_string = 'John Smit'
# itereate over files
for file in files:
try:
# open file for reading
with open(file, 'r') as f:
# read the contents
contents = f.read()
# check if contents have your target string
if target_string in conents:
print(file)
except:
pass
This will print the file path each time it found a name.
Please also note I have removed the second line from your code, because it is redundant, you initiate the list in line 3 anyway.
Hope it helps!
You could do it like this, though i think there must be a better approach
When you find all files in your directory, you iterate over them and check if they contain that specific string.
for file in files:
if(os.path.isfile(file)):
with open(file,'r') as f:
if('search_string' in f.read()):
print(file)

Duplicate in list created from filenames (python)

I'm trying to create a list of excel files that are saved to a specific directory, but I'm having an issue where when the list is generated it creates a duplicate entry for one of the file names (I am absolutely certain there is not actually a duplicate of the file).
import glob
# get data file names
path =r'D:\larvalSchooling\data'
filenames = glob.glob(path + "/*.xlsx")
output:
>>> filenames
['D:\\larvalSchooling\\data\\copy.xlsx', 'D:\\larvalSchooling\\data\\Raw data-SF_Fri_70dpf_GroupABC_n5_20200828_1140-Trial 1.xlsx', 'D:\\larvalSchooling\\data\\Raw data-SF_Sat_70dpf_GroupA_n5_20200808_1015-Trial 1.xlsx', 'D:\\larvalSchooling\\data\\Raw data-SF_Sat_84dpf_GroupABCD_n5_20200822_1440-Trial 1.xlsx', 'D:\\larvalSchooling\\data\\~$Raw data-SF_Fri_70dpf_GroupABC_n5_20200828_1140-Trial 1.xlsx']
you'll note 'D:\larvalSchooling\data\Raw data-SF_Fri_70dpf_GroupABC_n5_20200828_1140-Trial 1.xlsx' is listed twice.
Rather than going through after the fact and removing duplicates I was hoping to figure out why it's happening to begin with.
I'm using python 3.7 on windows 10 pro
If you wrote the code to remove duplicates (which can be as simple as filenames = set(filenames)) you'd see that you still have two filenames. Print them out one on top of the other to make a visual comparison easier:
'D:\\larvalSchooling\\data\\Raw data-SF_Sat_84dpf_GroupABCD_n5_20200822_1440-Trial 1.xlsx',
'D:\\larvalSchooling\\data\\~$Raw data-SF_Fri_70dpf_GroupABC_n5_20200828_1140-Trial 1.xlsx'
The second one has a leading ~ (probably an auto-backup).
Whenever you open an excel file it will create a ghost copy that works as a temporary backup copy for that specific file. In this case:
Raw data-SF_Fri_70dpf_GroupABC_n5_20200828_1140-Trial1.xlsx
~$ Raw data-SF_Fri_70dpf_GroupABC_n5_20200828_1140-Trial1.xlsx
This means that the file is open by some software and it's showing you that backup inside(usually that file is hidden from the explorer as well)
Just search for the program and close it. Other actions, such as adding validation so the "~$.*.xlsx" type of file is ignored should be also implemented if this is something you want to avoid.
You can use os.path.splittext to get the file extension and loop through the directory using os.listdir . The open excel files can be skipped using the following code:
filenames = []
for file in os.listdir('D:\larvalSchooling\data'):
filename, file_extension = os.path.splitext(file)
if file_extension == '.xlsx':
if not file.startswith('~$'):
filenames.append(file)
Note: this might not be the best solution, but it'll get the job done :)

Filetypes that Shutil.move and os.rename cannot transfer

I am new to python. I tried to search for answers but I cannot find a exact match to my question. I am trying to move all non-Excel files to another folder. However, there is an error when trying to move a .pbix file. I wonder if there are only limited number of filetypes supported by shutil.move() and os.rename() in moving files. And, are there any workarounds? Thank you.
UPDATE: The error is PermissionError. Actually, when I checked now the target folder, the file is transferred but the original file is retained.
Here is my sample code:
files = os.listdir(os.getcwd())
for f in files:
try:
data = pd.read_excel(f) # importing the file
except:
shutil.move("{}".format(f), r".\\Non_Excel_Files\{}".format(f))
It is now working. Thanks to the suggestion of S3DEV.
files = os.listdir(os.getcwd())
for f in files:
if os.path.splitext(f)[1] != ".xlsx":
shutil.move("{}".format(f), r".\\Non_Excel_Files\{}".format(f))

Incorrect file reading when using os.walk in python3

I am crawling through folders using the os.walk() method. In one of the folders, there is a large number of files, around 100,000 of them. The files look like: p_123_456.zip. But they are read as p123456.zip. Indeed, when I open windows explorer to browse the folder, for the first several seconds the files look like p123456.zip, but then change their appearance to p_123_456.zip. This is a strange scenario.
Now, I can't use time.sleep() because all folders and and files are being read into python variables in the looping line. Here is a snippet of the code:
for root, dirs, files in os.walk(srcFolder):
os.chdir(root)
for file in files:
shutil.copy(file, storeFolder)
In the last line, I get a file not found exception, saying that the file p123456.zip does not exist. Has anyone run into this mysterious issue? Anyway to bypass this? What is the cause of this? Thank you.
You don't seem to be concatenating the actual folder name with the filenames. Try changing your code to:
for root, dirs, files in os.walk(srcFolder):
for file in files:
shutil.copy(os.path.join(root, file), storeFolder)
os.chdir should be avoided like the plague. For one thing - if the changes suceeeds, it won't be the directory from which you are running your os.walk anymore - and then, a second chdir on another folder will fail (either stop your porgram or change you to an unexpected folder).
Just add the folder name as prefixes, and don't try using chdir.
Moreover, as for the comment from ShadowRanger above, os.walk officially breaks if you chdir inside its iteration - https://docs.python.org/3/library/os.html#os.walk - that is likely the root of the problem you had.

Python's os.walk() fails in Windows when there are long filenames

I use python os.walk() to get files and dirs in some directories, but there're files whose names are too long(>300), os.walk() return nothing, use onerror I get '[Error 234] More data is available'. I tried to use yield, but also get nothing and shows 'Traceback: StopIteration'.
OS is windows, code is simple. I have tested with a directory, if there's long-name file, problem occur, while if rename the long-name files with short names, code can get correct result.
I can do nothing for these directories, such as rename or move the long-name files.
Please help me to solve the problem!
def t(a):
for root,dirs,files in os.walk(a):
print root,dirs,files
t('c:/test/1')
In Windows file names (including path) can not be greater than 255 characters, so the error you're seeing comes from Windows, not from Python - because somehow you managed to create such big file names, but now you can't read them. See this post for more details.
The only workaround I can think of is to map the the folder to the specific directory. This will make the path way shorter. e.g. z:\myfile.xlsx instead of c:\a\b\c\d\e\f\g\myfile.xlsx

Categories

Resources