Break early from nested os.walk? - python

I am writing my own redundant file cleanup utility (mainly to help me learn Python 2.7)
The processing logic involves three steps:
Walk through potential redundant folder tree getting a filename.
Walk through the "golden" tree searching for a file previously found in 1.
If the files are equal, delete the redundant file (found in 1).
At this point, to save time, I want to break out of searching through golden tree.
Here is what I have so far.
# step 1
for redundant_root, redundant_dirs, redundant_files in os.walk(redundant_input_path):
redundant_path = redundant_root.split('/')
for redundant_file in redundant_files:
redundant_filename = redundant_root + "\\" + redundant_file
# step 2
for golden_root, golden_dirs, golden_files in os.walk(golden_input_path):
golden_path = golden_root.split('/')
for golden_file in golden_files:
golden_filename = golden_root + "\\" + golden_file
# step 3
if filecmp.cmp(golden_filename, redundant_filename, True):
print("removing " + redundant_filename)
os.remove(redundant_filename)
try:
(os.rmdir(redundant_root))
except:
pass
# here is where I want to break from continuing to search through the golden tree.

Related

problem with moving files using os.rename

i have this block of code where i try to move all the files in a folder to a different folder.
import os
from os import listdir
from os.path import isfile, join
def run():
print("Do you want to convert 1 file (0) or do you want to convert all the files in a folder(1)")
oneortwo = input("")
if oneortwo == "0":
filepathonefile = input("what is the filepath of your file?")
filepathonefilewithoutfullpath = os.path.basename(filepathonefile)
newfolder = "C:/Users/EL127032/Documents/fileconvertion/files/" + filepathonefilewithoutfullpath
os.rename(filepathonefile,newfolder)
if oneortwo == "1" :
filepathdirectory = input("what is the filepath of your folder?")
filesindirectory = [f for f in listdir(filepathdirectory) if isfile(join(filepathdirectory, f))]
numberoffiles = len(filesindirectory)
handlingfilenumber = 0
while numberoffiles > handlingfilenumber:
currenthandlingfile = filesindirectory[handlingfilenumber]
oldpathcurrenthandling = filepathdirectory + "/" + currenthandlingfile
futurepathcurrenhandlingfile = "C:/Users/EL127032/Documents/fileconvertion/files/" + currenthandlingfile
os.rename(oldpathcurrenthandling, futurepathcurrenhandlingfile)
but when i run this it gives
os.rename(oldpathcurrenthandling, futurepathcurrenhandlingfile)
FileNotFoundError: [WinError 2] System couldn't find the file: 'C:\Users\EL127032\Documents\Eligant - kopie\Klas 1\Stermodules\Basisbiologie/lopen (1).odt' -> 'C:/Users/EL127032/Documents/fileconvertion/files/lopen (1).odt'
can someone help me please.
You are trying to move the same file twice.
The bug is in this part :
numberoffiles = len(filesindirectory)
handlingfilenumber = 0
while numberoffiles > handlingfilenumber:
currenthandlingfile = filesindirectory[handlingfilenumber]
oldpathcurrenthandling = filepathdirectory + "/" + currenthandlingfile
futurepathcurrenhandlingfile = "C:/Users/EL127032/Documents/fileconvertion/files/" + currenthandlingfile
os.rename(oldpathcurrenthandling, futurepathcurrenhandlingfile)
The first time you loop, handlingfilenumber will be 0, so you will move the 0-th file from your filesindirectory list.
Then you loop again, handlingfilenumber is still 0, so you try to move it again, but it is not there anymore (you moved it already on the first turn).
You forgot to increment handlingfilenumber. Add handlingfilenumber += 1 on a line after os.rename and you will be fine.
while loops are more error-prone than simpler for loops, I recommend you use for loops when appropriate.
Here, you want to move each file, so a for loops suffices :
for filename in filesindirectory:
oldpathcurrenthandling = filepathdirectory + "/" + currenthandlingfile
futurepathcurrenhandlingfile = "C:/Users/EL127032/Documents/fileconvertion/files/" + currenthandlingfile
os.rename(oldpathcurrenthandling, futurepathcurrenhandlingfile)
No need to use len, initialize a counter, increment it, get the n-th element, ... And fewer lines.
Three other things :
you could have found the cause of the problem yourself, using debugging, there are plenty of ressources online to explain how to do it. Just printing the name of the file about to be copied (oldpathcurrenthandling) you would have seen it twice and noticed the problem causing the os error.
your variable names are not very readable. Consider following the standard style guide about variable names (PEP 8) and standard jargon, for example filepathonefilewithoutfullpath becomes filename, oldpathcurrenthandling becomes source_file_path (following the source/destination convention), ...
When you have an error, include the stacktrace that Python gives you. It would have pointed directly to the second os.rename case, the first one (when you copy only one file) does not contribute to the problem. It also helps finding a Minimal Reproducible Example.

Process 100 of feature classes through script and feature class name to end of each output

NOTE: Work constraints I must use python 2.7 (I know - eyeroll) and standard modules. I'm still learning python.
I have about 100 tiled 'area of interest' polygons in a geodatabase that need to be processed through my script. My script has been tested on individual tiles & works great. I need advice how to iterate this process so I don't have to run one at a time. (I don't want to iterate ALL 100 at once in case something fails - I just want to make a list or something to run about 10-15 at a time). I also need to add the tile name that I am processing to each feature class that I output.
So far I have tried using fnmatch.fnmatch which errors because it does not like a list. I changed syntax to parenthesis which did NOT error but did NOT print anything.
I figure once that naming piece is done, running the process in the for loop should work. Please help with advice what I am doing wrong or if there is a better way - thanks!
This is just a snippet of the full process:
tilename = 'T0104'
HIFLD_fc = os.path.join(work_dir, 'fc_clipped_lo' + tilename)
HIFLD_fc1 = os.path.join(work_dir, 'fc1_hifldstr_lo' + tilename)
HIFLD_fc2 = os.path.join(work_dir, 'fc2_non_ex_lo' + tilename)
HIFLD_fc3 = os.path.join(work_dir, 'fc3_no_wilder_lo' + tilename)
arcpy.env.workspace = (env_dir)
fcs = arcpy.ListFeatureClasses()
tile_list = ('AK1004', 'AK1005')
for tile in fcs:
filename, ext = os.path.splitext(tile)
if fnmatch.fnmatch(tile, tile_list):
print(tile)
arcpy.Clip_analysis(HIFLD_fc, bufferOut2, HIFLD_fc1, "")
print('HIFLD clipped for analysis')
arcpy.Clip_analysis(HIFLD_fc, env_mask, HIFLD_masked_rds, "")
print('HIFLD clipped by envelopes and excluded from analysis')
arcpy.Clip_analysis(HIFLD_masked_rds, wild_mask, HIFLD_excluded, "")
print('HIFLD clipped by wilderness mask and excluded from analysis')
arcpy.MakeFeatureLayer_management(HIFLD_fc1, 'hifld_lyr')
arcpy.SelectLayerByLocation_management('hifld_lyr', "COMPLETELY_WITHIN", bufferOut1, "", "NEW_SELECTION", "INVERT")
if arcpy.GetCount_management('hifld_lyr') > 0:
arcpy.CopyFeatures_management('hifld_lyr', HIFLD_fc2)
print('HIFLD split features deleted fc2')
else:
pass

Python consumes excessive memory, doesn't complete run even given adequate memory

I am writing a code for an information retrieval project, which reads Wikipedia pages in XML format from a file, processes the string (I've omitted this part for the sake of simplicity), tokenizes the strings and build positional indexes for the terms found on the pages. Then it saves the indexes to a file using pickle once, and then reads it from that file for the next usages for less processing time (I've included the code for that parts, but they're commented)
After that, I need to fill a 1572 * ~97000 matrix (1572 is the number of Wiki pages, and 97000 is the number of terms found in them. Like each Wiki page is a vector of words, and vectors[i][j] is, the number of occurrences of the i'th word of the word set in the j'th Wiki Page. (Again it's been simplified but it doesn't matter)
The problem is that it takes way too much memory to run the code, and even then, from a point between the 350th and 400th row of the matrix beyond, it doesn't proceed to run the code (it doesn't stop either). I thought the problem was with memory, because when its usage exceeded my 7.7GiB RAM and 1.7GiB swap, it stopped and printed:
Process finished with exit code 137 (interrupted by signal 9: SIGKILL)
But when I added a 6GiB memory by making a swap file for Python3.7 (using the script recommended here, the program didn't run out of memory, but instead got stuck when it had 7.7GiB RAM + 3.9GiB swap memory occupied) as I said, at a point between the 350th and 400th iteration of i in the loop at the bottom. Instead of Ubuntu 18.04,I tried it on Windows 10, the screen simply went black. I tried this on Windows 7, again to no avail.
Next I thought it was a PyCharm issue, so I ran the python file using the python3 file.py command, and it got stuck at the very point it had with PyCharm. I even used the numpy.float16 datatype to save memory, but it had no effect. I asked a colleague about their matrix dimensions, they were similar to mine, but they weren't having problems with it. Is it malware or a memory leak? Or is it something am I doing something wrong here?
import pickle
from hazm import *
from collections import defaultdict
import numpy as np
'''For each word there's one of these. it stores the word's frequency, and the positions it has occurred in on each wiki page'''
class Positional_Index:
def __init__(self):
self.freq = 0
self.title = defaultdict(list)
self.text = defaultdict(list)
'''Here I tokenize words and construct indexes for them'''
# tree = ET.parse('Wiki.xml')
# root = tree.getroot()
# index_dict = defaultdict(Positional_Index)
# all_content = root.findall('{http://www.mediawiki.org/xml/export-0.10/}page')
#
# for page_index, pg in enumerate(all_content):
# title = pg.find('{http://www.mediawiki.org/xml/export-0.10/}title').text
# txt = pg.find('{http://www.mediawiki.org/xml/export-0.10/}revision') \
# .find('{http://www.mediawiki.org/xml/export-0.10/}text').text
#
# title_arr = word_tokenize(title)
# txt_arr = word_tokenize(txt)
#
# for term_index, term in enumerate(title_arr):
# index_dict[term].freq += 1
# index_dict[term].title[page_index] += [term_index]
#
# for term_index, term in enumerate(txt_arr):
# index_dict[term].freq += 1
# index_dict[term].text[page_index] += [term_index]
#
# with open('texts/indices.txt', 'wb') as f:
# pickle.dump(index_dict, f)
with open('texts/indices.txt', 'rb') as file:
data = pickle.load(file)
'''Here I'm trying to keep the number of occurrences of each word on each page'''
page_count = 1572
vectors = np.array([[0 for j in range(len(data.keys()))] for i in range(page_count)], dtype=np.float16)
words = list(data.keys())
word_count = len(words)
const_log_of_d = np.log10(1572)
""" :( """
for i in range(page_count):
for j in range(word_count):
vectors[i][j] = (len(data[words[j]].title[i]) + len(data[words[j]].text[i]))
if i % 50 == 0:
print("i:", i)
Update : I tried this on a friend's computer, this time it killed the process at someplace between the 1350th-1400th iteration.

Traverse directories to a specific depth in Python [duplicate]

This question already has answers here:
Travel directory tree with limited recursion depth
(2 answers)
Closed 5 years ago.
I would like to search and print directories under c:// for example, but only list 1st and 2nd levels down, that do contain SP30070156-1.
what is the most efficient way to get this using python 2 without the script running though the entire sub-directories (so many in my case it would take a very long time)
typical directory names are as follow:
Rooty Hill SP30068539-1 3RD Split Unit AC Project
Oxford Falls SP30064418-1 Upgrade SES MSB
Queanbeyan SP30066062-1 AC
You can try to create a function based on os.walk(). Something like this should get you started:
import os
def walker(base_dir, level=1, string=None):
results = []
for root, dirs, files in os.walk(base_dir):
_root = root.replace(base_dir + '\\', '') #you may need to remove the "+ '\\'"
if _root.count('\\') < level:
if string is None:
results.append(dirs)
else:
if string in dirs:
results.append(dirs)
return results
Then you can just call it with string='SP30070156-1' and level 1 then level 2.
Not sure if it's going to be faster than 40s, though.
here is the code i used, the method is quick to list, if filtered for keyword then it is even quicker
import os
MAX_DEPTH = 1
#folders = ['U:\I-Project Works\PPM 20003171\PPM 11-12 NSW', 'U:\I-Project Works\PPM 20003171\PPM 11-12 QLD']
folders = ['U:\I-Project Works\PPM 20003171\PPM 11-12 NSW']
try:
for stuff in folders:
for root, dirs, files in os.walk(stuff, topdown=True):
for dir in dirs:
if "SP30070156-1" in dir:
sp_path = root + "\\"+ dir
print(sp_path)
raise Found
if root.count(os.sep) - stuff.count(os.sep) == MAX_DEPTH - 1:
del dirs[:]
except:
print "found"

loop np.load until file-index exceeds index of available files

I wish to read in all files from a folder using np.load without specifying the total number of files in advance. Currently, after a few loops the index will run out of the range of available files, and the code will terminate.
index = 0
while True:
a = np.load(file=filepath + 'c_l' + pc_output_layer + '_s0_p' + str(index) + '.npy')
layer = np.append(layer, a)
index += 1
How can I keep loading until an error occurs and then continue running the rest of the script? Thank you!
You could catch the exception and break out of the loop that way, but a more 'pythonic' way would be to loop over the filenames themselves, rather than using an index.
The glob library allows you to find files matching a given pattern and return a list you can then iterate over.
E.g.:
import glob
files = glob.glob(filepath + 'c_l*.npy')
for f in files:
a = np.load(file=f)
layer = np.append(layer, a)
You could also simplify it further by creating the layers directly using a list comprehension.

Categories

Resources