I need to compare two lists of records. One list has records that are stored in a network drive:
C:\root\to\file.pdf
O:\another\root\to\record.pdf
...
The other list has records stored in ProjectWise, collaboration software. It contains only filenames:
drawing.pdf
file.pdf
...
I want to create a list of the network drive file paths that do not have a filename that is in the ProjectWise list. It must include the paths. Currently, I am searching a list of each line in the drive list with a regular expression consisting of a line ending with any of the names in the ProjectWise list. The script is taking an unbearably long time and I feel I am overcomplicating the process.
I have thought about using sets to compare the lists (set(list1)-set(list2)) but this would only work with and return filenames on their own without the paths.
If you use os.path.basename on the list that contains full paths to the file you can get the filename and can then compare that to the other list.
import os
orig_list = [path_dict[os.path.basename(path) for path in file_path_list]
missing_filepaths = set(orig_list) - set(file_name_list)
that will get you a list of all filenames that don't have an associated path and you should be able to go from there.
Edit:
So, you want a list of paths that don't have an associated filename, correct? Then pretty simple. Extending from the code before you can do this:
paths_without_filenames = [path for path in file_path_list if os.path.split(path)[1] in missing_filepaths]
this will generate a list of filepaths from your list of filepaths that don't have an associated filename in the list of filenames.
Related
I have 1000 CSV files with names Radius_x where x stands for 0,1,2...11,12,13...999. But when I read the files and try to analyse the results, I wish to read them in the same order of whole numbers as listed above. But the code reads as follows (for example): ....145,146,147,148,149,15,150,150...159,16,160,161,...... and so on.
I know that if we rename the CSV files as Radius_xyz where xyz = 000,001,002,003,....010,011,012.....999, the problem could be resolved. Kindly help me as to how I can proceed.
To sort a list of paths numerically in python, first find all the files you are looking to open, then sort that iterable with a key which extracts the number.
With pathlib:
from pathlib import Path
files = list(Path("/tmp/so/").glob("Radius_*.csv")) # Path.glob returns a generator which needs to be put in a list
files.sort(key=lambda p: int(p.stem[7:])) # `Radius_` length is 7
files contains
[PosixPath('/tmp/so/Radius_1.csv'),
PosixPath('/tmp/so/Radius_2.csv'),
PosixPath('/tmp/so/Radius_3.csv'),
PosixPath('/tmp/so/Radius_4.csv'),
PosixPath('/tmp/so/Radius_5.csv'),
PosixPath('/tmp/so/Radius_6.csv'),
PosixPath('/tmp/so/Radius_7.csv'),
PosixPath('/tmp/so/Radius_8.csv'),
PosixPath('/tmp/so/Radius_9.csv'),
PosixPath('/tmp/so/Radius_10.csv'),
PosixPath('/tmp/so/Radius_11.csv'),
PosixPath('/tmp/so/Radius_12.csv'),
PosixPath('/tmp/so/Radius_13.csv'),
PosixPath('/tmp/so/Radius_14.csv'),
PosixPath('/tmp/so/Radius_15.csv'),
PosixPath('/tmp/so/Radius_16.csv'),
PosixPath('/tmp/so/Radius_17.csv'),
PosixPath('/tmp/so/Radius_18.csv'),
PosixPath('/tmp/so/Radius_19.csv'),
PosixPath('/tmp/so/Radius_20.csv')]
NB. files is a list of paths not strings, but most functions which deal with files accept both types.
A similar approach could be done with glob, which would give a list of strings not paths.
There are a lot of similar questions but I don't find an answer to exactly on this question. How to get ALL sub-directories starting from an initial one. The depth of the sub-directories is unknown.
Lets say I have:
data/subdir1/subdir2/file.csv
data/subdir1/subdir3/subdir4/subdir5/file2.csv
data/subdir6/subdir7/subdir8/file3.csv
So I would like to either get a list of all sub-directories all length deep OR even better all the paths one level before the files. In my example I would ideally want to get:
data/subdir1/subdir2/
data/subdir1/subdir3/subdir4/subdir5/
data/subdir6/subdir7/subdir8/
but I could work with this as well:
data/subdir1/
data/subdir1/subdir2/
data/subdir1/subdir3/
data/subdir1/subdir3/subdir4/
etc...
data/subdir6/subdir7/subdir8/
My code so far only gets me one level deep of directories:
result = await self.s3_client.list_objects(
Bucket=bucket, Prefix=prefix, Delimiter="/"
)
subfolders = set()
for content in result.get("CommonPrefixes"):
print(f"sub folder : {content.get('Prefix')}")
subfolders.add(content.get("Prefix"))
return subfolders
import os
# list_objects returns a dictionary. The 'Contents' key contains a
# list of full paths including the file name stored in the bucket
# for example: data/subdir1/subdir3/subdir4/subdir5/file2.csv
objects = s3_client.list_objects(Bucket='bucket_name')['Contents']
# here we iterate over the fullpaths and using
# os.path.dirname we get the fullpath excluding the filename
for obj in objects:
print(os.path.dirname(obj['Key'])
To make this a unique sorted list of directory "paths", we would use sort a set comprehension inline. Sets are unique, and sorted will convert this to a list.
See https://docs.python.org/3/tutorial/datastructures.html#sets
import os
paths = sorted({os.path.dirname(obj['Key']) for obj in objects})
I have a folder with hundreds of gigs worth of files. I have a list of filenames that I need to find from the folder.
My question is, is it faster if I iterate through the files in the folder (using glob for example) or create a list of all the file names and then iterate through that?
My initial assumption would be that creating a list would naturally be faster than iterating through the folder every time but since the list would contain hundreds of items, I'm not a 100% sure which is more efficient.
A list containing hundreds of filenames shouldn't be a performance problem. But for functions like glob you can often find an iterator form that streams the data for you.
You could use your list of desired filenames. Write a simple lambda:
lambda curr_item: curr_item in desired_list
And then use filter to go over the directory.
desired_list = list() # your list of the files you seek
found_filenames = filter(lambda el: el in desired_list, os.listdir(bigfolder)
But, what do you want to do with the files next? That kinda depends on what to do.
I want to make an array of subdirectories in Python. Here is an example layout and model list I would like to obtain.
Root
|
directories
/ \
subdir_1... subdir_n
Hence from the root, I would like to run a program that will make a list of all the subdirectories. That way if I were to write:
print(List_of_Subdirectories)
Where List_of_Subdirectories is the list of appended directories. I would obtain the output:
[subdir_1, subdir_2, ... , subdir_n]
In essence, I would like to achieve the same results as if I were to hard code every directory into a list. For example:
List_of_Subdirectories = ["subdir_1", "subdir_2", ... , "subdir_n"]
Where subdir_n denotes an arbitrary nth directory.
Unlike other posts here on stack overflow, I would like the list to contain just the directory names without tuples or paths.
If you just want the directory names, you can use os.walk to do this:
os.walk(directory)
will yield a tuple for each subdirectory. The first entry in the 3-tuple is a directory name. You can wrap this in a function to simply return the list of directory names like so:
def list_paths(path):
directories = [x[1] for x in os.walk(path)]
non_empty_dirs = [x for x in directories if x] # filter out empty lists
return [item for subitem in non_empty_dirs for item in subitem] # flatten the list
should give you all of the directories.
If all you want is a list of the subdirectories of an specified directory, all you need is os.listdir and a filter to display only directories.
It's as simple as:
List_of_Subdirectories = list(filter(os.path.isdir, os.listdir()))
print(List_of_Subdirectories)
The return from os.listdir is a list containing the names of all the available elements in the specified directory (or . by default), directories and files. We filter only the directories using os.path.isdir. Then, as you want a list, we explicitly convert the filtered result.
You wouldn't be able to print the filtered result, but you would be able to iterate over it. The snippet below would achieve the same result as the one avobe.
directory_elements = filter(os.path.isdir, os.listdir())
List_of_Subdirectories = []
for element in directory_elements:
List_of_Subdirectories.append(element)
print(List_of_Subdirectories)
I have a lot of files in a directory with name like:
'data_2000151_avg.txt', 'data_2000251_avg.txt', 'data_2003051_avg.txt'...
Assume that one of them is called fname. I would like to extract a subset from each like so:
fname.split('_')[1][:4]
This will give as a result, 2000. I want to collect these from all the files in the directory and create a unique list. How do I do that?
You should use os.
import os
dirname = 'PathToFile'
myuniquelist = []
for d in os.listdir(dirname):
if d.startswith('fname'):
myuniquelist.append(d.split('_')[1][:4])
EDIT: Just saw your comment on wanting a set. After the for loop add this line.
myuniquelist = list(set(myuniquelist))
If unique list means a list of unique values, then a combination of glob (in case the folder contains files that do not match the desired name format) and set should do the trick:
from glob import glob
uniques = {fname.split('_')[1][:4] for fname in glob('data_*_avg.txt')}
# In case you really do want a list
unique_list = list(uniques)
This assumes the files reside in the current working directory. Append path as necessary to glob('path/to/data_*_avg.txt').
For listing files in directory you can use os.listdir(). For generating the list of unique values best suitable is set comprehension.
import os
data = {f.split('_')[1][:4] for f in os.listdir(dir_path)}
list(data) #if you really need a list