My question is closely related to Python identify file with largest number as part of filename
I want to append files to a certain directory. The name of the files are: file1, file2......file^n. This works if i do it in one go, but when i want to add files again, and want to find the last file added (in this case the file with the highest number), it recognises 'file6' to be higher than 'file100'.
How can i solve this.
import glob
import os
latest_file = max(sorted(list_of_files, key=os.path.getctime))
print latest_file
As you can see i tried looking at created time and i also tried looking at modified time, but these can be the same so that doesn't help.
EDIT my filenames have the extention ".txt" after the number
I'll try to solve it only using filenames, not dates.
You have to convert to integer before appling criteria or alphanum sort applies to the whole filename
Proof of concept:
import re
list_of_files = ["file1","file100","file4","file7"]
def extract_number(f):
s = re.findall("\d+$",f)
return (int(s[0]) if s else -1,f)
print(max(list_of_files,key=extract_number))
result: file100
the key function extracts the digits found at the end of the file and converts to integer, and if nothing is found returns -1
you don't need to sort to find the max, just pass the key to max directly
if 2 files have the same index, use full filename to break tie (which explains the tuple key)
Using the following regular expression you can get the number of each file:
import re
maxn = 0
for file in list_of_files:
num = int(re.search('file(\d*)', file).group(1)) # assuming filename is "filexxx.txt"
# compare num to previous max, e.g.
maxn = num if num > maxn else maxn
At the end of the loop, maxn will be your highest filename number.
Related
I have various tar files in a Desktop folder (Ubuntu).
The filename is like this:
esarchive--James-AB-Test226-8037affd-06d1-4c61-a91f-816ec9cb825f-**05222017**-4.tar
The boldfaced part is the date. I want to sort the files in date order, most recent first.
Is there a simple python solution for this?
import glob
import datetime
import re
timeformat = "%m%d%Y"
regex = re.compile("^esarchive--James-AB-Test226-8037affd-06d1-4c61-a91f-816ec9cb825f-(\d*\d*)")
def gettimestamp(thestring):
m = regex.search(thestring)
return datetime.datetime.strptime(m.groups()[0], timeformat)
list_of_filenames = ['esarchive--James-AB-Test226-8037affd-06d1-4c61-a91f-816ec9cb825f-05212017-4','esarchive--James-AB-Test226-8037affd-06d1-4c61-a91f-816ec9cb825f-05202017-4']
for fn in sorted(list_of_filenames, key=gettimestamp):
print fn
No, there is not a simple Python function for this. However, there are reasonably simple building blocks from which you can make a readable solution.
Write a function to extract the date and rearrange it to be useful as a sort key. Find the last two hyphens in the file name, grab the string between them, and then rearrange the digits in the format yyyymmdd (year-month-day). Return that string or integer (either will work) as the functional value.
For your main routine, collect all the file names in a list (or make a generator) and sort them, using the value of that function as the sort key.
See the sorting wiki for some implementation details.
As Adam Smith have pointed out, you require the list of files to work with.
import glob, os
import datetime
import re
timeformat = "%m%d%Y"
regex = re.compile("(\d*\d*)-\d*.tar")
def gettimestamp(thestring):
m = regex.search(thestring[-14:-1])
if m:
return datetime.datetime.strptime(m.groups()[0], timeformat)
else:
return None
list_of_filenames = os.listdir('/home/james/Desktop/tarfolder')
for fn in sorted(list_of_filenames, key=gettimestamp):
print fn
Edit As Martineu has noticed, the hash might be different than the one you indicated so it would be easier to discard beginning of the name part in advance.
You don't need to parse the date, or even use regex for that matter. If the file names are structured as you say, it's sufficient to do just:
filenames = ['esarchive--James-AB-Test226-8037affd-06d1-4c61-a91f-816ec9cb825f-05212017-4',
'esarchive--James-AB-Test226-8037affd-06d1-4c61-a91f-816ec9cb825f-05202017-4',
'esarchive--James-AB-Test226-8037affd-06d1-4c61-a91f-816ec9cb825f-bad_date-4',]
def parse_date(name, offset=-10):
try:
date_str = name[offset:offset+8]
return int(date_str[-4:] + date_str[:2] + date_str[2:4])
except (IndexError, TypeError, ValueError): # invalid file name
return -1
sorted_list = [x[1] for x in sorted((parse_date(l), l) for l in filenames) if x[0] != -1]
# ['esarchive--James-AB-Test226-8037affd-06d1-4c61-a91f-816ec9cb825f-05202017-4',
# 'esarchive--James-AB-Test226-8037affd-06d1-4c61-a91f-816ec9cb825f-05212017-4']
UPDATE - I've added the offset argument to specify where in the file name begins your date. In the list you've posted it begins 10 characters from the back (default), but if you've had a .tar extension after the name, as in your initial example, you'd account those 4 characters as well and use offset of -14:
names = ['James-AB-Test226-8037affd-06d1-4c61-a91f-816ec9cb825f-05212017-4.tar',
'James-AB-Test226-8037affd-06d1-4c61-a91f-816ec9cb825f-05202017-4.tar',
'James-AB-Test226-8037affd-06d1-4c61-a91f-816ec9cb825f-bad_date-4.tar']
sorted_list = [x[1] for x in sorted((parse_date(l, -14), l) for l in names) if x[0] != -1]
# ['James-AB-Test226-8037affd-06d1-4c61-a91f-816ec9cb825f-05202017-4.tar',
# 'James-AB-Test226-8037affd-06d1-4c61-a91f-816ec9cb825f-05212017-4.tar']
I am currently writing a script that cycles through all the files in a folder and renames them according to a naming convention.
What I would like to achieve is the following; if the script finds 2 files that have the same number in the filename (e.g. '101 test' and '101 real') it will move those two files to a different folder named 'duplicates'.
My original plan was to use glob to cycle through all the files in the folder and add every file containing a certain number to a list. The list would then be checked in length, and if the length exceeded 1 (i.e. there are 2 files with the same number), then the files would be located to this 'duplicates' folder. However for some reason this does not work.
Here is my code, I was hoping someone with more experience than me can give me some insight into how to achieve my goal, Thanks!:
app = askdirectory(parent=root)
for x in range(804):
listofnames = []
real = os.path.join(app, '*{}*').format(x)
for name in glob.glob(real):
listofnames.append(name)
y = len(listofnames)
if y > 1:
for names in listofnames:
path = os.path.join(app, names)
shutil.move(path,app + "/Duplicates")
A simple way is to collect filenames with numbers in a structure like this:
numbers = {
101: ['101 test', '101 real'],
93: ['hugo, 93']
}
and if a list in this dict is longer than one do the move.
import re, os
from collections import defaultdict
app = askdirectory(parent=root)
# a magic dict
numbers = defaultdict(list)
# list all files in this dir
for filename in os.listdir(app):
# \d+ means a decimal number of any length
match = re.search('\d+', filename)
if match is None:
# no digits found
continue
#extract the number
number = int(match.group())
# defaultdict magic
numbers[number].append(filename)
for number, filenames in numbers.items():
if len(filenames) < 2:
# not a dupe
continue
for filename in filenames:
shutil.move(os.path.join(app, filename),
os.path.join(app, "Duplicates"))
defaultdict magic is just a short hand for the following code:
if number not in numbers:
numbers.append(list())
numbers[number] = filename
I have a directory of files, some of them image files. Some of those image files are a sequence of images. They could be named image-000001.png, image-000002.png and so on, or perhaps 001_sequence.png, 002_sequence.png etc.
How can we identify images that would, to a human, appear by their names to be fairly obviously in a sequence? This would mean identifying only those image filenames that have increasing numbers and all have a similar form of filename.
The similar part of the filename would not be pre-defined.
You can use a regular expression to get files adhering to a certain pattern, e.g. .*\d+.*\.(jpg|png) for anything, then a number, then more anything, and an image extension.
files = ["image-000001.png", "image-000002.png", "001_sequence.png",
"002_sequence.png", "not an image 1.doc", "not an image 2.doc",
"other stuff.txt", "singular image.jpg"]
import re
image_files = [f for f in files if re.match(r".*\d+.*\.(jpg|png)", f)]
Now, group those image files by replacing the number with some generic string, e.g. XXX:
patterns = collections.defaultdict(list)
for f in image_files:
p = re.sub("\d+", "XXX", f)
patterns[p].append(f)
As a result, patterns is
{'image-XXX.png': ['image-000001.png', 'image-000002.png'],
'XXX_sequence.png': ['001_sequence.png', '002_sequence.png']}
Similarly, it should not be too hard to check whether all those numbers are consecutive, but maybe that's not really necessary after all. Note, however, that this will have problems discriminating numbered series such as "series1_001.jpg", and "series2_001.jpg".
What I would suggest is to use regex trough files and group matching pattern with list of associated numbers from the file-name.
Once this is done, just loop trough the dictionnaries keys and ensure that count of elements is the same that the range of matched numbers.
import re
from collections import defaultdict
from os import listdir
files = listdir("/the/path/")
found_patterns = defaultdict(list)
p = re.compile("(.*?)(\d+)(.*)\.png")
for f in files:
if p.match(f):
s = p.search(f)
pattern = s.group(1) + "___" + s.group(3)
num = int(s.group(2))
found_patterns[pattern].append(num)
for pattern, found in found_patterns.items():
mini, maxi = min(found), max(found)
if len(found) == maxi - mini + 1:
print("Pattern correct: %s" % pattern)
Of course, this will not work if there are some missing value but you can use some acceptance error.
Lets say I have three files in a folder: file9.txt, file10.txt and file11.txt and i want to read them in this particular order. Can anyone help me with this?
Right now I am using the code
import glob, os
for infile in glob.glob(os.path.join( '*.txt')):
print "Current File Being Processed is: " + infile
and it reads first file10.txt then file11.txt and then file9.txt.
Can someone help me how to get the right order?
Files on the filesystem are not sorted. You can sort the resulting filenames yourself using the sorted() function:
for infile in sorted(glob.glob('*.txt')):
print "Current File Being Processed is: " + infile
Note that the os.path.join call in your code is a no-op; with only one argument it doesn't do anything but return that argument unaltered.
Note that your files will sort in alphabetical ordering, which puts 10 before 9. You can use a custom key function to improve the sorting:
import re
numbers = re.compile(r'(\d+)')
def numericalSort(value):
parts = numbers.split(value)
parts[1::2] = map(int, parts[1::2])
return parts
for infile in sorted(glob.glob('*.txt'), key=numericalSort):
print "Current File Being Processed is: " + infile
The numericalSort function splits out any digits in a filename, turns it into an actual number, and returns the result for sorting:
>>> files = ['file9.txt', 'file10.txt', 'file11.txt', '32foo9.txt', '32foo10.txt']
>>> sorted(files)
['32foo10.txt', '32foo9.txt', 'file10.txt', 'file11.txt', 'file9.txt']
>>> sorted(files, key=numericalSort)
['32foo9.txt', '32foo10.txt', 'file9.txt', 'file10.txt', 'file11.txt']
You can wrap your glob.glob( ... ) expression inside a sorted( ... ) statement and sort the resulting list of files. Example:
for infile in sorted(glob.glob('*.txt')):
You can give sorted a comparison function or, better, use the key= ... argument to give it a custom key that is used for sorting.
Example:
There are the following files:
x/blub01.txt
x/blub02.txt
x/blub10.txt
x/blub03.txt
y/blub05.txt
The following code will produce the following output:
for filename in sorted(glob.glob('[xy]/*.txt')):
print filename
# x/blub01.txt
# x/blub02.txt
# x/blub03.txt
# x/blub10.txt
# y/blub05.txt
Now with key function:
def key_func(x):
return os.path.split(x)[-1]
for filename in sorted(glob.glob('[xy]/*.txt'), key=key_func):
print filename
# x/blub01.txt
# x/blub02.txt
# x/blub03.txt
# y/blub05.txt
# x/blub10.txt
EDIT:
Possibly this key function can sort your files:
pat=re.compile("(\d+)\D*$")
...
def key_func(x):
mat=pat.search(os.path.split(x)[-1]) # match last group of digits
if mat is None:
return x
return "{:>10}".format(mat.group(1)) # right align to 10 digits.
It sure can be improved, but I think you get the point. Paths without numbers will be left alone, paths with numbers will be converted to a string that is 10 digits wide and contains the number.
You need to change the sort from 'ASCIIBetical' to numeric by isolating the number in the filename. You can do that like so:
import re
def keyFunc(afilename):
nondigits = re.compile("\D")
return int(nondigits.sub("", afilename))
filenames = ["file10.txt", "file11.txt", "file9.txt"]
for x in sorted(filenames, key=keyFunc):
print xcode here
Where you can set filenames with the result of glob.glob("*.txt");
Additinally the keyFunc function assumes the filename will have a number in it, and that the number is only in the filename. You can change that function to be as complex as you need to isolate the number you need to sort on.
glob.glob(os.path.join( '*.txt'))
returns a list of strings, so you can easily sort the list using pythons sorted() function.
sorted(glob.glob(os.path.join( '*.txt')))
for fname in ['file9.txt','file10.txt','file11.txt']:
with open(fname) as f: # default open mode is for reading
for line in f:
# do something with line
I would like to ask how to efficiently handle accessing of filenames in a folder in the right order (alphabetical and increasing in number).
For example, I have the following files in a folder: apple1.dat, apple2.dat, apple10.dat, banana1.dat, banana2.dat, banana10.dat. I would like to read the contents of the files such that apple1.dat will be read first and banana10.dat will be read last.
Thanks.
This is what I did so far.
from glob import glob
files=glob('*.dat')
for list in files
# I read the files here in order
But as pointed out, apple10.dat comes before apple2.dat
from glob import glob
import os
files_list = glob(os.path.join(my_folder, '*.dat'))
for a_file in sorted(files_list):
# do whatever with the file
# 'open' or 'with' statements depending on your python version
try this one.
import os
def get_sorted_files(Directory)
filenamelist = []
for root, dirs, files in os.walk(Directory):
for name in files:
fullname = os.path.join(root, name)
filenamelist.append(fullname)
return sorted(filenamelist)
You have to cast the numbers to an int first. Doing it the long way would require breaking the names into the strings and numbers, casting the numbers to an int and sorting. Perhaps someone else has a shorter or more efficient way.
def split_in_two(str_in):
## go from right to left until a letter is found
## assume first letter of name is not a digit
for ctr in range(len(str_in)-1, 0, -1):
if not str_in[ctr].isdigit():
return str_in[:ctr+1], str_in[ctr+1:] ## ctr+1 = first digit
## default for no letters found
return str_in, "0"
files=['apple1.dat', 'apple2.dat', 'apple10.dat', 'apple11.dat',
'banana1.dat', 'banana10.dat', 'banana2.dat']
print sorted(files) ## sorted as you say
sort_numbers = []
for f in files:
## split off '.dat.
no_ending = f[:-4]
str_1, str_2 = split_in_two(no_ending)
sort_numbers.append([str_1, int(str_2), ".dat"])
sort_numbers.sort()
print sort_numbers