Extract subset from several file names using python - python

I have a lot of files in a directory with name like:
'data_2000151_avg.txt', 'data_2000251_avg.txt', 'data_2003051_avg.txt'...
Assume that one of them is called fname. I would like to extract a subset from each like so:
fname.split('_')[1][:4]
This will give as a result, 2000. I want to collect these from all the files in the directory and create a unique list. How do I do that?

You should use os.
import os
dirname = 'PathToFile'
myuniquelist = []
for d in os.listdir(dirname):
if d.startswith('fname'):
myuniquelist.append(d.split('_')[1][:4])
EDIT: Just saw your comment on wanting a set. After the for loop add this line.
myuniquelist = list(set(myuniquelist))

If unique list means a list of unique values, then a combination of glob (in case the folder contains files that do not match the desired name format) and set should do the trick:
from glob import glob
uniques = {fname.split('_')[1][:4] for fname in glob('data_*_avg.txt')}
# In case you really do want a list
unique_list = list(uniques)
This assumes the files reside in the current working directory. Append path as necessary to glob('path/to/data_*_avg.txt').

For listing files in directory you can use os.listdir(). For generating the list of unique values best suitable is set comprehension.
import os
data = {f.split('_')[1][:4] for f in os.listdir(dir_path)}
list(data) #if you really need a list

Related

Find files with regex and their respective directory

I'm working on the 'C:\Documents' directory.
It has many subdirectories and I need to find all the files that their filename starts with 'A0' prefix and ends with '.xls' extension. For example, 'A0SSS.xls' or 'A0ASDF.xls'
Is it possible to fetch all those files and get their directory?
For instace, if the file 'A0SSS.xls' is located on 'C:\Documents\Folder1', I need to know the file name (A0SSS.xls) along with their respective directory (C:\Documents\Folder1).
To find the path of the matching files, you run a recursive search with a filter. I recommend for you to use pathlib, so you can easily get the parent folder for each of them. The list of parent folders can be redundant, if you have got multiple matching files in the same folder. There are many ways to make a list unique in python. One of them is to convert the list to set, which must be unique by definition, and convert it back to list.
from pathlib import Path
search_path = Path("C:\Documents")
results = list(search_path.rglob("A0*.xlsx"))
string_results = [str(matching_path) for matching_path in results]
containing_folders = [r.parent for r in results]
unique_folders = list(set(containing_folders))
print("matching files:")
for r in string_results:
print(r)
print()
print("containing folders:")
for f in unique_folders:
print(f)

Searching for filenames containing multiple keywords in python

I have a directory containing a large number of files. I want to find all files, where the file name contains specific strings (e.g. a certain ending like '.txt', a model ID 'model_xy', etc.) as well as one of the entries in an integer array (e.g. a number of years I would like to select).
I tried this the following way:
import numpy as np
import glob
startyear = 2000
endyear = 2005
timerange = str(np.arange(startyear,endyear+1))
data_files = []
for file in glob.glob('/home/..../*model_xy*'+timerange+'.txt'):
data_files.append(file);
print(data_files)
Unfortunately, like this, other files outside of my 'timerange' are still selected.
You can use regex in glob.glob. Moreover, glob.glob returns a list so you don't need to iterate through it and append to new list.
import glob
data_files = glob.glob("/home/..../*model_xy*200[0-5].txt")
# or if you want to do recursive search you can use **
# Below code will search for all those files under /home/ recursively
data_file = glob.glob("/home/**/*model_xy*200[0-5].txt")

Looping through files using lists

I have a folder with pseudo directory (/usr/folder/) of files that look like this:
target_07750_20181128.tsv.gz
target_07750_20181129.tsv.gz
target_07751_20181130.tsv.gz
target_07751_20181203.tsv.gz
target_07751_20181204.tsv.gz
target_27103_20181128.tsv.gz
target_27103_20181129.tsv.gz
target_27103_20181130.tsv.gz
I am trying to join the above tsv files to one xlsx file on store code (found in the file names above).
I am reading say file.xlsx and reading that in as a pandas dataframe.
I have extracted store codes from file.xlsx so I have the following:
stores = instore.store_code.astype(str).unique()
output:
07750
07751
27103
So my end goal is to loop through each store in stores and find which filename that corresponds to in directory. Here is what I have so far but I can't seem to get the proper filename to print:
import os
for store in stores:
print(store)
if store in os.listdir('/usr/folder/'):
print(os.listdir('/usr/folder/'))
The output I'm expecting to see for say store_code in loop = '07750' would be:
07750
target_07750_20181128.tsv.gz
target_07750_20181129.tsv.gz
Instead I'm only seeing the store codes returned:
07750
07751
27103
What am I doing wrong here?
The reason your if statement fails is that it checks if "07750" etc is one of the filenames in the directory, which it is not. What you want is to see if "07750" is contained in one of the filenames.
I'd go about it like this:
from collections import defaultdict
store_files = defaultdict(list)
for filename in os.listdir('/usr/folder/'):
store_number = <some string magic to extract the store number; you figure it out>
store_files[store_number].append(filename)
Now store_files will be a dictionary with a list of filenames for each store number.
The problem is that you're assuming a substring search -- that's not how in works on a list. For instance, on the first iteration, your if looks like this:
if "07750" in ["target_07750_20181128.tsv.gz",
"target_07750_20181129.tsv.gz",
"target_07751_20181130.tsv.gz",
... ]:
The string "07755" is not an element of that list. It does appear as a substring, but in doesn't work that way on a list. Instead, try this:
for filename in os.listdir('/usr/folder/'):
if '_' + store + '_' in filename:
print(filename)
Does that help?

Query Related to Python - Folders Read

I want to read folders in python and probably make a list of it. Now my main concern is that most recent folder should be at location that is known to me. It can be the first element or last element of list. I am attaching image suggesting folders name. I want folder with name 20181005 either first in the list or last in the list.
I have tried this task and used os.listdir, but I am not very much confident on the way this function reads folders and store in list form. Would it store first folder as element or will it use creation date or modification date. If I could sort on the basis of name (20181005 etc), it would be really good.
Kindly suggest suitable method for the same.
Regards
os.listdir returns directory contents in arbitrary order, but you can sort that yourself:
l = sorted(listdir())
Since it seems that your folder names are ISO dates, they should sort correctly and the most recent one should be the last element after sorting.
If you need to access creation & modification times you can do that with os.path functions. If you want to sort by that, I would probably choose to put it in something like a pandas DataFrame to make it easier to manipulate.
import os
from datetime import datetime
import pandas as pd
path = "."
objects = os.listdir(path)
dirs = list()
for o in objects:
opath = os.path.join(path, o)
if os.path.isdir(opath):
dirs.append(dict(path=opath,
mtime=datetime.fromtimestamp(os.path.getmtime(opath)),
ctime=datetime.fromtimestamp(os.path.getctime(opath))))
data = pd.DataFrame(dirs)
data.sort_values(by='mtime')
Assumed, your directories has YYYYMMDD format naming. Then you can use listdir and sort to get the latest directory in last index.
import os
from os import listdir
mypath = 'D:\\anil'
list_dirs = []
for f in listdir(mypath):
if os.path.isdir(os.path.join(mypath, f)):
list_dirs.append(f)
list_dirs.sort()
for current_dir in list_dirs:
print(current_dir)

Comparing two differently formatted lists in Python?

I need to compare two lists of records. One list has records that are stored in a network drive:
C:\root\to\file.pdf
O:\another\root\to\record.pdf
...
The other list has records stored in ProjectWise, collaboration software. It contains only filenames:
drawing.pdf
file.pdf
...
I want to create a list of the network drive file paths that do not have a filename that is in the ProjectWise list. It must include the paths. Currently, I am searching a list of each line in the drive list with a regular expression consisting of a line ending with any of the names in the ProjectWise list. The script is taking an unbearably long time and I feel I am overcomplicating the process.
I have thought about using sets to compare the lists (set(list1)-set(list2)) but this would only work with and return filenames on their own without the paths.
If you use os.path.basename on the list that contains full paths to the file you can get the filename and can then compare that to the other list.
import os
orig_list = [path_dict[os.path.basename(path) for path in file_path_list]
missing_filepaths = set(orig_list) - set(file_name_list)
that will get you a list of all filenames that don't have an associated path and you should be able to go from there.
Edit:
So, you want a list of paths that don't have an associated filename, correct? Then pretty simple. Extending from the code before you can do this:
paths_without_filenames = [path for path in file_path_list if os.path.split(path)[1] in missing_filepaths]
this will generate a list of filepaths from your list of filepaths that don't have an associated filename in the list of filenames.

Categories

Resources