Iterating through a list vs folder in python - python

I have a folder with hundreds of gigs worth of files. I have a list of filenames that I need to find from the folder.
My question is, is it faster if I iterate through the files in the folder (using glob for example) or create a list of all the file names and then iterate through that?
My initial assumption would be that creating a list would naturally be faster than iterating through the folder every time but since the list would contain hundreds of items, I'm not a 100% sure which is more efficient.

A list containing hundreds of filenames shouldn't be a performance problem. But for functions like glob you can often find an iterator form that streams the data for you.

You could use your list of desired filenames. Write a simple lambda:
lambda curr_item: curr_item in desired_list
And then use filter to go over the directory.
desired_list = list() # your list of the files you seek
found_filenames = filter(lambda el: el in desired_list, os.listdir(bigfolder)
But, what do you want to do with the files next? That kinda depends on what to do.

Related

Modifying the order of reading CSV Files in Python according to the name

I have 1000 CSV files with names Radius_x where x stands for 0,1,2...11,12,13...999. But when I read the files and try to analyse the results, I wish to read them in the same order of whole numbers as listed above. But the code reads as follows (for example): ....145,146,147,148,149,15,150,150...159,16,160,161,...... and so on.
I know that if we rename the CSV files as Radius_xyz where xyz = 000,001,002,003,....010,011,012.....999, the problem could be resolved. Kindly help me as to how I can proceed.
To sort a list of paths numerically in python, first find all the files you are looking to open, then sort that iterable with a key which extracts the number.
With pathlib:
from pathlib import Path
files = list(Path("/tmp/so/").glob("Radius_*.csv")) # Path.glob returns a generator which needs to be put in a list
files.sort(key=lambda p: int(p.stem[7:])) # `Radius_` length is 7
files contains
[PosixPath('/tmp/so/Radius_1.csv'),
PosixPath('/tmp/so/Radius_2.csv'),
PosixPath('/tmp/so/Radius_3.csv'),
PosixPath('/tmp/so/Radius_4.csv'),
PosixPath('/tmp/so/Radius_5.csv'),
PosixPath('/tmp/so/Radius_6.csv'),
PosixPath('/tmp/so/Radius_7.csv'),
PosixPath('/tmp/so/Radius_8.csv'),
PosixPath('/tmp/so/Radius_9.csv'),
PosixPath('/tmp/so/Radius_10.csv'),
PosixPath('/tmp/so/Radius_11.csv'),
PosixPath('/tmp/so/Radius_12.csv'),
PosixPath('/tmp/so/Radius_13.csv'),
PosixPath('/tmp/so/Radius_14.csv'),
PosixPath('/tmp/so/Radius_15.csv'),
PosixPath('/tmp/so/Radius_16.csv'),
PosixPath('/tmp/so/Radius_17.csv'),
PosixPath('/tmp/so/Radius_18.csv'),
PosixPath('/tmp/so/Radius_19.csv'),
PosixPath('/tmp/so/Radius_20.csv')]
NB. files is a list of paths not strings, but most functions which deal with files accept both types.
A similar approach could be done with glob, which would give a list of strings not paths.

How to remove duplicate pdf files from a list in python

I have a list containing the pdf files
l=['ab.pdf', 'cd.pdf', 'ef.pdf', 'gh.pdf']
Out of these four files few are duplicate only names are changed, how to delete those file from the list ?
for example ab.pdf and cd.pdf are same, so the final output will be
l=['ab.pdf', 'ef.pdf', 'gh.pdf']
I have tried filecmp library but it only tells if two files are duplicate.
How to do it most efficiently in pythonic way ?
Then you need to tell the computer the path of the files in your computer and compare the files one by one, although it sounds not such effective.

Comparing two differently formatted lists in Python?

I need to compare two lists of records. One list has records that are stored in a network drive:
C:\root\to\file.pdf
O:\another\root\to\record.pdf
...
The other list has records stored in ProjectWise, collaboration software. It contains only filenames:
drawing.pdf
file.pdf
...
I want to create a list of the network drive file paths that do not have a filename that is in the ProjectWise list. It must include the paths. Currently, I am searching a list of each line in the drive list with a regular expression consisting of a line ending with any of the names in the ProjectWise list. The script is taking an unbearably long time and I feel I am overcomplicating the process.
I have thought about using sets to compare the lists (set(list1)-set(list2)) but this would only work with and return filenames on their own without the paths.
If you use os.path.basename on the list that contains full paths to the file you can get the filename and can then compare that to the other list.
import os
orig_list = [path_dict[os.path.basename(path) for path in file_path_list]
missing_filepaths = set(orig_list) - set(file_name_list)
that will get you a list of all filenames that don't have an associated path and you should be able to go from there.
Edit:
So, you want a list of paths that don't have an associated filename, correct? Then pretty simple. Extending from the code before you can do this:
paths_without_filenames = [path for path in file_path_list if os.path.split(path)[1] in missing_filepaths]
this will generate a list of filepaths from your list of filepaths that don't have an associated filename in the list of filenames.

python glob.glob() regex multi files in different dir

I am trying to use glob.glob() to get a list of files that comes from different directories with two kinds of suffix.
For example, the files i am going to read are
/ABC/DEF/HIJ/*.{data,index}
and
/ABC/LMN/HIJ[0-3]/*.{data,index}
I was asked to do it with only a single glob.glob() call.
How can I do it? Thanks.
You could try using a list comprehension (if this fits your single call criteria),
files_wanted = ['/ABC/DEF/HIJ/*.data', '/ABC/DEF/HIJ/*.index', '/ABC/LMN/HIJ[0-3]/*.data', '/ABC/LMN/HIJ[0-3]/*.index'] #List containing your regular expressions.
files_list = [glob.glob(re) for re in files_wanted] #List comprehension.
Hope this works for you!

python list n files then next n files in a directory and Map it to a mapper function

I have a directory in which I have around hundred thousands of text files.
Python code creates a list of names of this files,
listoffiles = os.listdir(directory)
I break this listoffiles with lol function in 64 parts
lol = lambda lst, sz: [lst[i:i+sz] for i in range(0, len(lst), sz)]
partitioned_listoffiles = lol(listoffiles, 64)
Then I pool it to 2 processes
pool = Pool(processes=2,)
single_count_tuples = pool.map(Map, partitioned_listoffiles)
In Map function I read those files and do further processing
My problem is this code works fine if I do it for small folder with thousands of files. Large directories it runs out of memory. How should I solve this issue. Can I read first n files and then next n files and create listoffiles and process this steps in for loop.
If the directory is very very large then you could use scandir() instead of os.listdir(). But it is unlikely that os.listdir() causes MemoryError therefore the issue is in the other two places:
Use a generator expression instead of list comprehension:
chunks = (lst[i:i+n] for i in range(0, len(lst), n))
Use pool.imap or pool.imap_unordered instead of pool.map():
for result in pool.imap_unordered(Map, chunks):
pass
Or better:
files = os.listdir(directory)
for result in pool.imap_unordered(process_file, files, chunksize=100):
pass
I've had a very similar problem, where I was required to verify a certain number of files are in a specific folder. The problem was that the folder may contain up to 20 million very small files.
From what I've learned, there is no possibility to limit pythons listdir to a certain amount of items.
My listdir takes quite a while to list the directory and a lot of RAM but manages to run on a VM with 4GB RAM..
You may want to try using glob instead, which might keep the file list smaller, depending on your requirements.
import glob
print glob.glob("/tmp/*.txt")

Categories

Resources