Get file names into list using glob [duplicate] - python

This question already has answers here:
How do you sort files numerically?
(6 answers)
Closed 3 years ago.
I want to load file names into an array to load images from their paths. This job is done by the solution provided here. My code is something like this
fileNames = glob.glob(os.path.join(directory, "*.jpg"))
My filenames are something similar to this pattern
{videoNo}_{frameNo}_{patchNo).jpg
For example their names are like this
1_1_1.jpg
1_1_2.jpg
1_1_3.jpg
.
.
.
10_1_1.jpg
10_1_2.jpg
When I load filenames into fileNames array, they are like this
10_1_1.jpg
10_1_2.jpg
.
.
.
1_1_1.jpg
1_1_2.jpg
1_1_3.jpg
As far as I know this is because the asci code for _ is bigger than 0 and because of that the list of names is not sorted! I must work with the sorted list. Can anyone give me a hand here?
EDIT
Please notice that the sorted of these file names
["1_1_1.jpg", "10_1_3.jpg", "1_1_2.jpg", "10_1_2.jpg", "1_1_3.jpg", "1_20_1", "1_2_1", "1_14_1"]
is similar to this sorted list
["1_1_1.jpg", "1_1_2.jpg", "1_1_3.jpg", "1_2_1.jpg", "1_14_1", "1_20_1", "10_1_2.jpg", "10_1_3"]

The sorted builtin and list.sort method both take a key parameter that specifies how to do the sorting. If you want to sort by the numbers in the name (i.e. videoNo, then frameNo, then patchNo) you can split each name into these numbers:
fileNames = sorted(
glob.glob(os.path.join(directory, "*.jpg")),
key=lambda item: [
int(part) for part in os.path.splitext(item)[0].split('_')
],
)
The splitting strips off the .jpg extension, then cuts the name on each _. Conversion to int is needed because strings use lexicographic sorting, e.g. "2" > "10".

You could use a regular expression to extract the numbers from the file names and sort by those:
>>> import re
>>> files = ["10_1_3.jpg", "1_10_2.jpg", "3_1_1.jpg", "30_1_2.jpg"]
>>> sorted(files, key=lambda f: tuple(map(int, re.findall(r"\d+", f))))
['1_10_2.jpg', '3_1_1.jpg', '10_1_3.jpg', '30_1_2.jpg']

Related

how to sort a list containing filenames?

I have a list of ~1000+ values in it. The values are the names of files in a folder which is given by os.listdir(folder_path)
code looks like this:
import os
folder_path = "some path here"
filelist = os.listdir(folder_path)
print(filelist)
Now when I look at the printed list, I see that the list isn't sorted by name. The filenames are something like ["text-1-1.txt","txt-1-23.txt","txt-1-32.txt","txt-1-10.txt","txt-2-1.txt","txt-2-32.txt"...]
Also, I know that there are filenames that increment by one, like: text-1-1.txt, text-1-2.txt, text-1-3.txt,.... text-2-1.txt, text-2-2.txt,...
I have tried these two methods to try and sort the list: new_list = sorted(filelist) & filelist.sort()
Both did not work and the list came out to be the same as the original, how can I sort this list? Do I have to manually write sorting algorithms(like Bubble, or Selection)?
You can run it this way:
import os
folder_path = "some path here"
filelist = os.listdir(folder_path)
filelist.sort() #Added this line
print(filelist)
By default, python already sorts strings in lexicographical order, but uppercase letters are all sorted before lowercase letters. If you want to sort strings and ignore case, then you can do
new_filelist = sorted(filelist, key=str.lower)
You can create a custom function for this, that creates a tuple of ints from the filenames:
>>> def sl_no(s):
return tuple(map(int,s.split('.')[0].rsplit('-', 2)[-2:]))
>>> sl_no("text-1-1.txt")
(1, 1)
>>> sorted(filelist, key=sl_no)
['text-1-1.txt',
'txt-1-10.txt',
'txt-1-23.txt',
'txt-1-32.txt',
'txt-2-1.txt',
'txt-2-32.txt']
Or, you can use re:
>>> import re
>>> sorted(filelist, lambda x: tuple(re.findall(r'\d+', x)))
['text-1-1.txt',
'txt-1-10.txt',
'txt-1-23.txt',
'txt-1-32.txt',
'txt-2-1.txt',
'txt-2-32.txt']
in order to support all kinds of file names that contain numbers, you can define a sortKey function that will isolate the numeric parts of the names and right justify them (with leading zeros) for the purpose of sorting:
import re
def sortKey(n):
return "".join([s,f"{s:>010}"][s.isdigit()] for s in re.split(r"(\d+)",n))
output:
names = ["text-1-1.txt","txt-1-23.txt","txt-1-32.txt","txt-1-10.txt",
"txt-2-1.txt","txt-2-32.txt"]
print(sorted(names,key=sortKey))
# ['text-1-1.txt', 'txt-1-10.txt', 'txt-1-23.txt', 'txt-1-32.txt',
# 'txt-2-1.txt', 'txt-2-32.txt']
names = ["log2020/12/23.txt","log2021/1/3.txt","log2021/02/1.txt",
"log2021/1/1.txt","log2021/1/13.txt"]
print(sorted(names,key=sortKey))
# ['log2020/12/23.txt', 'log2021/1/1.txt', 'log2021/1/3.txt',
# 'log2021/1/13.txt', 'log2021/02/1.txt']

Extracting numbers from a filename string in python

I have a number of html files in a directory. I am trying to store the filenames in a list so that I can use it later to compare with another list.
Eg: Prod224_0055_00007464_20170930.html is one of the filenames. From the filename, I want to extract '00007464' and store this value in a list and repeat the same for all the other files in the directory. How do I go about doing this? I am new to Python and any help would be greatly appreciated!
Please let me know if you need more information to answer the question.
Split the filename on underscores and select the third element (index 2).
>>> 'Prod224_0055_00007464_20170930.html'.split('_')[2]
'00007464'
In context that might look like this:
nums = [f.split('_')[2] for f in os.listdir(dir) if f.endswith('.html')]
you may try this (assuming you are in the folder with the files:
import os
num_list = []
r, d, files = os.walk( '.' ).next()
for f in files :
parts = f.split('_') # now `parts` contains ['Prod224', '0055', '00007464', '20170930.html']
print parts[2] # this outputs '00007464'
num_list.append( parts[2] )
Assuming you have a certain pattern for your files, you can use a regex:
>>> import re
>>> s = 'Prod224_0055_00007464_20170930.html'
>>> desired_number = re.findall("\d+", s)[2]
>>> desired_number
'00007464'
Using a regex will help you getting not only that specific number you want, but also other numbers in the file name.
This will work if the name of your files follow the pattern "[some text][number]_[number]_[desired_number]_[a date].html". After getting the number, I think it will be very simple to use the append method to add that number to any list you want.

Find file in directory with the highest number in the filename

My question is closely related to Python identify file with largest number as part of filename
I want to append files to a certain directory. The name of the files are: file1, file2......file^n. This works if i do it in one go, but when i want to add files again, and want to find the last file added (in this case the file with the highest number), it recognises 'file6' to be higher than 'file100'.
How can i solve this.
import glob
import os
latest_file = max(sorted(list_of_files, key=os.path.getctime))
print latest_file
As you can see i tried looking at created time and i also tried looking at modified time, but these can be the same so that doesn't help.
EDIT my filenames have the extention ".txt" after the number
I'll try to solve it only using filenames, not dates.
You have to convert to integer before appling criteria or alphanum sort applies to the whole filename
Proof of concept:
import re
list_of_files = ["file1","file100","file4","file7"]
def extract_number(f):
s = re.findall("\d+$",f)
return (int(s[0]) if s else -1,f)
print(max(list_of_files,key=extract_number))
result: file100
the key function extracts the digits found at the end of the file and converts to integer, and if nothing is found returns -1
you don't need to sort to find the max, just pass the key to max directly
if 2 files have the same index, use full filename to break tie (which explains the tuple key)
Using the following regular expression you can get the number of each file:
import re
maxn = 0
for file in list_of_files:
num = int(re.search('file(\d*)', file).group(1)) # assuming filename is "filexxx.txt"
# compare num to previous max, e.g.
maxn = num if num > maxn else maxn
At the end of the loop, maxn will be your highest filename number.

Reading files in a particular order in python

Lets say I have three files in a folder: file9.txt, file10.txt and file11.txt and i want to read them in this particular order. Can anyone help me with this?
Right now I am using the code
import glob, os
for infile in glob.glob(os.path.join( '*.txt')):
print "Current File Being Processed is: " + infile
and it reads first file10.txt then file11.txt and then file9.txt.
Can someone help me how to get the right order?
Files on the filesystem are not sorted. You can sort the resulting filenames yourself using the sorted() function:
for infile in sorted(glob.glob('*.txt')):
print "Current File Being Processed is: " + infile
Note that the os.path.join call in your code is a no-op; with only one argument it doesn't do anything but return that argument unaltered.
Note that your files will sort in alphabetical ordering, which puts 10 before 9. You can use a custom key function to improve the sorting:
import re
numbers = re.compile(r'(\d+)')
def numericalSort(value):
parts = numbers.split(value)
parts[1::2] = map(int, parts[1::2])
return parts
for infile in sorted(glob.glob('*.txt'), key=numericalSort):
print "Current File Being Processed is: " + infile
The numericalSort function splits out any digits in a filename, turns it into an actual number, and returns the result for sorting:
>>> files = ['file9.txt', 'file10.txt', 'file11.txt', '32foo9.txt', '32foo10.txt']
>>> sorted(files)
['32foo10.txt', '32foo9.txt', 'file10.txt', 'file11.txt', 'file9.txt']
>>> sorted(files, key=numericalSort)
['32foo9.txt', '32foo10.txt', 'file9.txt', 'file10.txt', 'file11.txt']
You can wrap your glob.glob( ... ) expression inside a sorted( ... ) statement and sort the resulting list of files. Example:
for infile in sorted(glob.glob('*.txt')):
You can give sorted a comparison function or, better, use the key= ... argument to give it a custom key that is used for sorting.
Example:
There are the following files:
x/blub01.txt
x/blub02.txt
x/blub10.txt
x/blub03.txt
y/blub05.txt
The following code will produce the following output:
for filename in sorted(glob.glob('[xy]/*.txt')):
print filename
# x/blub01.txt
# x/blub02.txt
# x/blub03.txt
# x/blub10.txt
# y/blub05.txt
Now with key function:
def key_func(x):
return os.path.split(x)[-1]
for filename in sorted(glob.glob('[xy]/*.txt'), key=key_func):
print filename
# x/blub01.txt
# x/blub02.txt
# x/blub03.txt
# y/blub05.txt
# x/blub10.txt
EDIT:
Possibly this key function can sort your files:
pat=re.compile("(\d+)\D*$")
...
def key_func(x):
mat=pat.search(os.path.split(x)[-1]) # match last group of digits
if mat is None:
return x
return "{:>10}".format(mat.group(1)) # right align to 10 digits.
It sure can be improved, but I think you get the point. Paths without numbers will be left alone, paths with numbers will be converted to a string that is 10 digits wide and contains the number.
You need to change the sort from 'ASCIIBetical' to numeric by isolating the number in the filename. You can do that like so:
import re
def keyFunc(afilename):
nondigits = re.compile("\D")
return int(nondigits.sub("", afilename))
filenames = ["file10.txt", "file11.txt", "file9.txt"]
for x in sorted(filenames, key=keyFunc):
print xcode here
Where you can set filenames with the result of glob.glob("*.txt");
Additinally the keyFunc function assumes the filename will have a number in it, and that the number is only in the filename. You can change that function to be as complex as you need to isolate the number you need to sort on.
glob.glob(os.path.join( '*.txt'))
returns a list of strings, so you can easily sort the list using pythons sorted() function.
sorted(glob.glob(os.path.join( '*.txt')))
for fname in ['file9.txt','file10.txt','file11.txt']:
with open(fname) as f: # default open mode is for reading
for line in f:
# do something with line

Extract and sort numbers from filnames in python

I have a very basic question. I have files named like Dipole_E0=1.2625E-01.dat and I want to extract the 1.2625E-01 part and finally sort them by ascending order. How can this be done ? I tried first to plit the filename with .split() but it does not what I expect. Thanks for your help.
Best
Roland
Best way is to use regexp. To obtain value from file name:
m = re.search(filename, '^Dipole_E0=(.*)/s?')
val = m.group(0)
Walk through all dilenames and append all values to array. After that sort and that's all.
You want to look into regular expressions. In python they live in the re module. Depending on exact format, something like:
import re
ematch = re.compile("=([0-9]*\.[0-9]*[eE][+-][0-9]+)")
val = ematch.search(filename).group(0)
Sorting a list can be done with the .sort() method on lists, or the sorted(list) builtin, which give you a new list.
This is a good situation to use a generator expression and the sorted builtin:
sorted(float(filename.split("=", 1)[1].rsplit(".", 1)[0]) for filename in filenames)
Where filenames is your list of filenames.
>>> filenames = ["Dipole_E0=1.2625E-01.dat", "Dipole_E0=1.3625E-01.dat", "Dipole_E0=0.2625E-01.dat"]
>>> sorted(float(filename.split("=", 1)[1].rsplit(".", 1)[0]) for filename in filenames)
[0.02625, 0.12625, 0.13625]
You can get the filenames with the glob module.
from glob import glob
file_names = glob("yourpath/*.dat")
vals = []
for name in file_names:
vals.append(float(name[:-4].rpartition("=")[2]))
vals.sort()
name[:-4] throws away the ".dat". rpartition is a string method. It returns a tuple where entry 0 is the string left of the string used to split, entry 1 is the string used to split (here: "=") and entry 2 is the string right of this string (here: your float). Then it is converted to a float and appended to the list of values.

Categories

Resources