Searching for filenames containing multiple keywords in python - python

I have a directory containing a large number of files. I want to find all files, where the file name contains specific strings (e.g. a certain ending like '.txt', a model ID 'model_xy', etc.) as well as one of the entries in an integer array (e.g. a number of years I would like to select).
I tried this the following way:
import numpy as np
import glob
startyear = 2000
endyear = 2005
timerange = str(np.arange(startyear,endyear+1))
data_files = []
for file in glob.glob('/home/..../*model_xy*'+timerange+'.txt'):
data_files.append(file);
print(data_files)
Unfortunately, like this, other files outside of my 'timerange' are still selected.

You can use regex in glob.glob. Moreover, glob.glob returns a list so you don't need to iterate through it and append to new list.
import glob
data_files = glob.glob("/home/..../*model_xy*200[0-5].txt")
# or if you want to do recursive search you can use **
# Below code will search for all those files under /home/ recursively
data_file = glob.glob("/home/**/*model_xy*200[0-5].txt")

Related

Finding filenames in a folder that contain a variable from a list, opening the JSON files and performing actions on them

I'm working with JSON filetypes and I've created some code that will open a single file and add it to a pandas dataframe, performing some procedures on the data within, snipper of this code as follows;
response_dic=first_response.json()
print(response_dic)
base_df=pd.DataFrame(response_dic)
base_df.head()
The code then goes on to extract parts of the JSON data into dataframes, before merging and printing to CSV.
Where I want to develop the code, is to have it iterate through a folder first, find filenames that match my list of filenames that I want to work on and then perform the functions on those filenames. For example, I have a folder with 1000 docs, I will only need to perform the function on a sample of these.
I've created a list in CSV of the account codes that I want to work on, I've then imported the csv details and created a list of account codes as follows:
csv_file=open(r'C:\filepath','r')
cikas=[]
cikbs=[]
csv_file.readline()
for a,b,c in csv.reader(csv_file, delimiter=','):
cikas.append(a)
cikbs.append(b)
midstring=[s for s in cikbs]
print(midstring)
My account names are then stored in midstring, for example ['12345', '2468', '56789']. This means I can control which account codes are worked on by amending my CSV file in future. These names will vary at different stages hence I don't want to absolutely define them at this stage.
What I would like the code to do, is check the working directory, see if there is a file that matches for example C:\Users*12345.json. If there is, perform the pandas procedures upon it, then move to the next file. Is this possible? I've tried a number of tutorials involving glob, iglob, fnmatch etc but struggling to come up with a workable solution.
you can list all the files with .json extension in the current directory first.
import os, json
import pandas as pd
path_to_json = 'currentdir/'
json_files = [json_file for json_file in os.listdir(path_to_json) if json_file.endswith('.json')]
print(json_files)
Now iterate over the list of json_files and perform a check
# example list json_files= ['12345.json','2468.json','56789.json']
# midstring = ['12345', '2468, '56789']
for file in json_files:
if file.split('.')[0] in midstring:
df = pd.DataFrame.from_dict(json_file)
# perform pandas functions
else:
continue

python - get highest number in filenames in a directory [duplicate]

This question already has answers here:
Find file in directory with the highest number in the filename
(2 answers)
Closed 3 years ago.
I'm developing a timelapse camera on a read-only filesystem which writes images on a USB stick, without real-time clock and internet connection, then I can't use datetime to maintain the temporal order of files and prevent overwriting.
So I could store images as 1.jpg, 2.jpg, 3.jpg and so on and update the counter in a file last.txt on the USB stick, but I'd rather avoid to do that and I'm trying to calculate the last filename at boot, but having 9.jpg and 10.jpg print(max(glob.glob('/home/pi/Desktop/timelapse/*'))) returns me 9.jpg, also I think that glob would be slow with thousands of files, how can I solve this?
EDIT
I found this solution:
import glob
import os
import ntpath
max=0
for name in glob.glob('/home/pi/Desktop/timelapse/*.jpg'):
n=int(os.path.splitext(ntpath.basename(name))[0])
if n>max:
max=n
print(max)
but it takes about 3s every 10.000 files, is there a faster solution apart divide files into sub-folders?
Here:
latest_file_index = max([int(f[:f.index('.')]) for f in os.listdir('path_to_folder_goes_here')])
Another idea is just to use the length of the file list (assuming all fiels in the folder are the jpg files)
latest_file_index = len(os.listdir(dir))
You need to extract the numbers from the filenames and convert them to integer to get proper numeric ordering.
For example like so:
from pathlib import Path
folder = Path('/home/pi/Desktop/timelapse')
highest = max(int(file.stem) for file in folder.glob('*.jpg'))
For more complicated file-name patterns this approach could be extended with regular expressions.
Using re:
import re
filenames = [
'file1.jpg',
'file2.jpg',
'file3.jpg',
'file4.jpg',
'fileA.jpg',
]
### We'll match on a general pattern of any character before a number + '.jpg'
### Then, we'll look for a file with that number in its name and return the result
### Note: We're grouping the number with parenthesis, so we have to extract that with each iteration.
### We also skip over non-matching results with teh conditional 'if'
### Since a list is returned, we can unpack that by calling index zero.
max_file = [file for file in filenames if max([re.match(r'.*(\d+)\.jpg', i).group(1) for i in filenames if re.match(r'.*(\d+)\.jpg', i)]) in file][0]
print(f'The file with the maximum number is: {max_file}')
Output:
The file with the maximum number is: file4.jpg
Note: This will work whether there are letters before the number in the filename or not, so you can name the files (pretty much) whatever you want.
*Second solution: Use the creation date. *
This is similar to the first, but we'll use the os module and iterate the directory, returning a file with the latest creation date:
import os
_dir = r'C:\...\...'
max_file = [x for x in os.listdir(_dir) if os.path.getctime(os.path.join(_dir, x)) == max([os.path.getctime(os.path.join(_dir, i)) for i in os.listdir(_dir)])]
You can use os.walk(), because it gives you the list of filenames it founds, and then append in another list every value you found after removing '.jpg' extension and casting the string to int, and then a simple call of max will do the work.
import os
# taken from https://stackoverflow.com/questions/3207219/how-do-i-list-all-files-of-a-directory
_, _, filenames = next(os.walk(os.getcwd()), (None, None, []))
values = []
for filename in filenames:
try:
values.append(int(filename.lower().replace('.jpg','')))
except ValueError:
pass # not a file with format x.jpg
max_value = max(values)

Query Related to Python - Folders Read

I want to read folders in python and probably make a list of it. Now my main concern is that most recent folder should be at location that is known to me. It can be the first element or last element of list. I am attaching image suggesting folders name. I want folder with name 20181005 either first in the list or last in the list.
I have tried this task and used os.listdir, but I am not very much confident on the way this function reads folders and store in list form. Would it store first folder as element or will it use creation date or modification date. If I could sort on the basis of name (20181005 etc), it would be really good.
Kindly suggest suitable method for the same.
Regards
os.listdir returns directory contents in arbitrary order, but you can sort that yourself:
l = sorted(listdir())
Since it seems that your folder names are ISO dates, they should sort correctly and the most recent one should be the last element after sorting.
If you need to access creation & modification times you can do that with os.path functions. If you want to sort by that, I would probably choose to put it in something like a pandas DataFrame to make it easier to manipulate.
import os
from datetime import datetime
import pandas as pd
path = "."
objects = os.listdir(path)
dirs = list()
for o in objects:
opath = os.path.join(path, o)
if os.path.isdir(opath):
dirs.append(dict(path=opath,
mtime=datetime.fromtimestamp(os.path.getmtime(opath)),
ctime=datetime.fromtimestamp(os.path.getctime(opath))))
data = pd.DataFrame(dirs)
data.sort_values(by='mtime')
Assumed, your directories has YYYYMMDD format naming. Then you can use listdir and sort to get the latest directory in last index.
import os
from os import listdir
mypath = 'D:\\anil'
list_dirs = []
for f in listdir(mypath):
if os.path.isdir(os.path.join(mypath, f)):
list_dirs.append(f)
list_dirs.sort()
for current_dir in list_dirs:
print(current_dir)

Extract subset from several file names using python

I have a lot of files in a directory with name like:
'data_2000151_avg.txt', 'data_2000251_avg.txt', 'data_2003051_avg.txt'...
Assume that one of them is called fname. I would like to extract a subset from each like so:
fname.split('_')[1][:4]
This will give as a result, 2000. I want to collect these from all the files in the directory and create a unique list. How do I do that?
You should use os.
import os
dirname = 'PathToFile'
myuniquelist = []
for d in os.listdir(dirname):
if d.startswith('fname'):
myuniquelist.append(d.split('_')[1][:4])
EDIT: Just saw your comment on wanting a set. After the for loop add this line.
myuniquelist = list(set(myuniquelist))
If unique list means a list of unique values, then a combination of glob (in case the folder contains files that do not match the desired name format) and set should do the trick:
from glob import glob
uniques = {fname.split('_')[1][:4] for fname in glob('data_*_avg.txt')}
# In case you really do want a list
unique_list = list(uniques)
This assumes the files reside in the current working directory. Append path as necessary to glob('path/to/data_*_avg.txt').
For listing files in directory you can use os.listdir(). For generating the list of unique values best suitable is set comprehension.
import os
data = {f.split('_')[1][:4] for f in os.listdir(dir_path)}
list(data) #if you really need a list

Attempting to read data from multiple files to multiple arrays

I would like to be able to read data from multiple files in one folder to multiple arrays and then perform analysis on these arrays such as plot graphs etc. I am currently having trouble reading the data from these files into multiple arrays.
My solution process so far is as follows;
import numpy as np
import os
#Create an empty list to read filenames to
filenames = []
for file in os.listdir('C\\folderwherefileslive'):
filenames.append(file)
This works so far, what I'd like to do next is to iterate over the filenames in the list using numpy.genfromtxt.
I'm trying to use os.path join to put the individual list entry at the end of the path specified in listdir earlier. This is some example code:
for i in filenames:
file_name = os.path.join('C:\\entryfromabove','i')
'data_'+[i] = np.genfromtxt('file_name',skiprows=2,delimiter=',')
This piece of code returns "Invalid syntax".
To sum up the solution process I'm trying to use so far:
1. Use os.listdir to get all the filenames in the folder I'm looking at.
2. Use os.path.join to direct np.genfromtxt to open and read data from each file to a numpy array named after that file.
I'm not experienced with python by any means - any tips or questions on what I'm trying to achieve are welcome.
For this kind of task you'd want to use a dictionary.
data = {}
for file in os.listdir('C\\folderwherefileslive'):
filenames.append(file)
path = os.path.join('C:\\folderwherefileslive', i)
data[file] = np.genfromtxt(path, skiprows=2, delimiter=',')
# now you could for example access
data['foo.txt']
Notice, that everything you put within single or double quotes ends up being a character string, so 'file_name' will just be some characters, whereas using file_name would use the value stored in variable by that name.

Categories

Resources