python glob.glob() regex multi files in different dir - python

I am trying to use glob.glob() to get a list of files that comes from different directories with two kinds of suffix.
For example, the files i am going to read are
/ABC/DEF/HIJ/*.{data,index}
and
/ABC/LMN/HIJ[0-3]/*.{data,index}
I was asked to do it with only a single glob.glob() call.
How can I do it? Thanks.

You could try using a list comprehension (if this fits your single call criteria),
files_wanted = ['/ABC/DEF/HIJ/*.data', '/ABC/DEF/HIJ/*.index', '/ABC/LMN/HIJ[0-3]/*.data', '/ABC/LMN/HIJ[0-3]/*.index'] #List containing your regular expressions.
files_list = [glob.glob(re) for re in files_wanted] #List comprehension.
Hope this works for you!

Related

Modifying the order of reading CSV Files in Python according to the name

I have 1000 CSV files with names Radius_x where x stands for 0,1,2...11,12,13...999. But when I read the files and try to analyse the results, I wish to read them in the same order of whole numbers as listed above. But the code reads as follows (for example): ....145,146,147,148,149,15,150,150...159,16,160,161,...... and so on.
I know that if we rename the CSV files as Radius_xyz where xyz = 000,001,002,003,....010,011,012.....999, the problem could be resolved. Kindly help me as to how I can proceed.
To sort a list of paths numerically in python, first find all the files you are looking to open, then sort that iterable with a key which extracts the number.
With pathlib:
from pathlib import Path
files = list(Path("/tmp/so/").glob("Radius_*.csv")) # Path.glob returns a generator which needs to be put in a list
files.sort(key=lambda p: int(p.stem[7:])) # `Radius_` length is 7
files contains
[PosixPath('/tmp/so/Radius_1.csv'),
PosixPath('/tmp/so/Radius_2.csv'),
PosixPath('/tmp/so/Radius_3.csv'),
PosixPath('/tmp/so/Radius_4.csv'),
PosixPath('/tmp/so/Radius_5.csv'),
PosixPath('/tmp/so/Radius_6.csv'),
PosixPath('/tmp/so/Radius_7.csv'),
PosixPath('/tmp/so/Radius_8.csv'),
PosixPath('/tmp/so/Radius_9.csv'),
PosixPath('/tmp/so/Radius_10.csv'),
PosixPath('/tmp/so/Radius_11.csv'),
PosixPath('/tmp/so/Radius_12.csv'),
PosixPath('/tmp/so/Radius_13.csv'),
PosixPath('/tmp/so/Radius_14.csv'),
PosixPath('/tmp/so/Radius_15.csv'),
PosixPath('/tmp/so/Radius_16.csv'),
PosixPath('/tmp/so/Radius_17.csv'),
PosixPath('/tmp/so/Radius_18.csv'),
PosixPath('/tmp/so/Radius_19.csv'),
PosixPath('/tmp/so/Radius_20.csv')]
NB. files is a list of paths not strings, but most functions which deal with files accept both types.
A similar approach could be done with glob, which would give a list of strings not paths.

I don't understand Python Solution I used to Glob multiple file types

So I need to parse and read files that have only "fip" or "frp" in them. Rather than just doing all full glob then using an if statment, I decided to search the webs on how I can achieve this. I stumbled on the answers here: Python glob multiple filetypes
Specifically modified my code to use a solution I found:
flist = [f for f_ in [odfslogs_p_handler.glob(e) for e in ('*frp*', '*fip*')] for f in f_]
The p_handler is a pathlib object. Now this code works. I just need some help understanding the wizardry behind it.
I know this is list comprehension but I've only dealth with simple examples. Can someone please explain to me why this works?
Also can I chain in more patterns in the inner tuples? So lets say I wanna also parse .txt, and .csv. Is it just a matter of adding a comma and including those patterns inside the tuple ?

Comparing two differently formatted lists in Python?

I need to compare two lists of records. One list has records that are stored in a network drive:
C:\root\to\file.pdf
O:\another\root\to\record.pdf
...
The other list has records stored in ProjectWise, collaboration software. It contains only filenames:
drawing.pdf
file.pdf
...
I want to create a list of the network drive file paths that do not have a filename that is in the ProjectWise list. It must include the paths. Currently, I am searching a list of each line in the drive list with a regular expression consisting of a line ending with any of the names in the ProjectWise list. The script is taking an unbearably long time and I feel I am overcomplicating the process.
I have thought about using sets to compare the lists (set(list1)-set(list2)) but this would only work with and return filenames on their own without the paths.
If you use os.path.basename on the list that contains full paths to the file you can get the filename and can then compare that to the other list.
import os
orig_list = [path_dict[os.path.basename(path) for path in file_path_list]
missing_filepaths = set(orig_list) - set(file_name_list)
that will get you a list of all filenames that don't have an associated path and you should be able to go from there.
Edit:
So, you want a list of paths that don't have an associated filename, correct? Then pretty simple. Extending from the code before you can do this:
paths_without_filenames = [path for path in file_path_list if os.path.split(path)[1] in missing_filepaths]
this will generate a list of filepaths from your list of filepaths that don't have an associated filename in the list of filenames.

Iterating through a list vs folder in python

I have a folder with hundreds of gigs worth of files. I have a list of filenames that I need to find from the folder.
My question is, is it faster if I iterate through the files in the folder (using glob for example) or create a list of all the file names and then iterate through that?
My initial assumption would be that creating a list would naturally be faster than iterating through the folder every time but since the list would contain hundreds of items, I'm not a 100% sure which is more efficient.
A list containing hundreds of filenames shouldn't be a performance problem. But for functions like glob you can often find an iterator form that streams the data for you.
You could use your list of desired filenames. Write a simple lambda:
lambda curr_item: curr_item in desired_list
And then use filter to go over the directory.
desired_list = list() # your list of the files you seek
found_filenames = filter(lambda el: el in desired_list, os.listdir(bigfolder)
But, what do you want to do with the files next? That kinda depends on what to do.

Regex in Python to match all the files in a folder

I'm very bad at regex.
I'm trying to locate files in a folder based on the file names. Most of the filenames are in the format GSE1234_series_matrix.txt, hence I've been using os.path.join("files", GSE_num + "_series_matrix.txt"). However, a few files have names like GSE1234-GPL22_series_matrix.txt. I'm not sure how to address all the files starting with a GSE number and ending with _series_matrix.txt together, possibly in one statement. I'd really appreciate any help.
EDIT - I have these series matrix text files in a folder, for which I mention the path using path join. I also input a text file, which has all the GSE numbers. This way it runs the script only for selected GSE numbers. So not everything that's in the folder is in GSE num list AND the list just has GSE numbers and not GPL. For instance the file GSE1234-GPL22_series_matrix.txt would be GSE1234 in the list.
Skip using regexes entirely.
good_filenames = [name for name in filenames if name.startswith("GSE") and name.endswith("_series_matrix.txt")]
You could use glob. Depending on how much of the path you include in the pattern, you wouldn't have to worry about using os.path.join at all.
import glob
good_filenames = glob.glob('/your/path/here/GSE*_series_matrix.txt')
returns:
['/your/path/here/GSE1234_series_matrix.txt',
'/your/path/here/GSE1234-GPL22_series_matrix.txt']
Kevin's answer is great! If you'd like to use a regex, you can do something like this:
^GSE\d+.*series_matrix.txt$
That would match anything that starts with GSE and a number, and ends with series_matrix.txt

Categories

Resources