Lets say I have three files in a folder: file9.txt, file10.txt and file11.txt and i want to read them in this particular order. Can anyone help me with this?
Right now I am using the code
import glob, os
for infile in glob.glob(os.path.join( '*.txt')):
print "Current File Being Processed is: " + infile
and it reads first file10.txt then file11.txt and then file9.txt.
Can someone help me how to get the right order?
Files on the filesystem are not sorted. You can sort the resulting filenames yourself using the sorted() function:
for infile in sorted(glob.glob('*.txt')):
print "Current File Being Processed is: " + infile
Note that the os.path.join call in your code is a no-op; with only one argument it doesn't do anything but return that argument unaltered.
Note that your files will sort in alphabetical ordering, which puts 10 before 9. You can use a custom key function to improve the sorting:
import re
numbers = re.compile(r'(\d+)')
def numericalSort(value):
parts = numbers.split(value)
parts[1::2] = map(int, parts[1::2])
return parts
for infile in sorted(glob.glob('*.txt'), key=numericalSort):
print "Current File Being Processed is: " + infile
The numericalSort function splits out any digits in a filename, turns it into an actual number, and returns the result for sorting:
>>> files = ['file9.txt', 'file10.txt', 'file11.txt', '32foo9.txt', '32foo10.txt']
>>> sorted(files)
['32foo10.txt', '32foo9.txt', 'file10.txt', 'file11.txt', 'file9.txt']
>>> sorted(files, key=numericalSort)
['32foo9.txt', '32foo10.txt', 'file9.txt', 'file10.txt', 'file11.txt']
You can wrap your glob.glob( ... ) expression inside a sorted( ... ) statement and sort the resulting list of files. Example:
for infile in sorted(glob.glob('*.txt')):
You can give sorted a comparison function or, better, use the key= ... argument to give it a custom key that is used for sorting.
Example:
There are the following files:
x/blub01.txt
x/blub02.txt
x/blub10.txt
x/blub03.txt
y/blub05.txt
The following code will produce the following output:
for filename in sorted(glob.glob('[xy]/*.txt')):
print filename
# x/blub01.txt
# x/blub02.txt
# x/blub03.txt
# x/blub10.txt
# y/blub05.txt
Now with key function:
def key_func(x):
return os.path.split(x)[-1]
for filename in sorted(glob.glob('[xy]/*.txt'), key=key_func):
print filename
# x/blub01.txt
# x/blub02.txt
# x/blub03.txt
# y/blub05.txt
# x/blub10.txt
EDIT:
Possibly this key function can sort your files:
pat=re.compile("(\d+)\D*$")
...
def key_func(x):
mat=pat.search(os.path.split(x)[-1]) # match last group of digits
if mat is None:
return x
return "{:>10}".format(mat.group(1)) # right align to 10 digits.
It sure can be improved, but I think you get the point. Paths without numbers will be left alone, paths with numbers will be converted to a string that is 10 digits wide and contains the number.
You need to change the sort from 'ASCIIBetical' to numeric by isolating the number in the filename. You can do that like so:
import re
def keyFunc(afilename):
nondigits = re.compile("\D")
return int(nondigits.sub("", afilename))
filenames = ["file10.txt", "file11.txt", "file9.txt"]
for x in sorted(filenames, key=keyFunc):
print xcode here
Where you can set filenames with the result of glob.glob("*.txt");
Additinally the keyFunc function assumes the filename will have a number in it, and that the number is only in the filename. You can change that function to be as complex as you need to isolate the number you need to sort on.
glob.glob(os.path.join( '*.txt'))
returns a list of strings, so you can easily sort the list using pythons sorted() function.
sorted(glob.glob(os.path.join( '*.txt')))
for fname in ['file9.txt','file10.txt','file11.txt']:
with open(fname) as f: # default open mode is for reading
for line in f:
# do something with line
Related
I have a list of ~1000+ values in it. The values are the names of files in a folder which is given by os.listdir(folder_path)
code looks like this:
import os
folder_path = "some path here"
filelist = os.listdir(folder_path)
print(filelist)
Now when I look at the printed list, I see that the list isn't sorted by name. The filenames are something like ["text-1-1.txt","txt-1-23.txt","txt-1-32.txt","txt-1-10.txt","txt-2-1.txt","txt-2-32.txt"...]
Also, I know that there are filenames that increment by one, like: text-1-1.txt, text-1-2.txt, text-1-3.txt,.... text-2-1.txt, text-2-2.txt,...
I have tried these two methods to try and sort the list: new_list = sorted(filelist) & filelist.sort()
Both did not work and the list came out to be the same as the original, how can I sort this list? Do I have to manually write sorting algorithms(like Bubble, or Selection)?
You can run it this way:
import os
folder_path = "some path here"
filelist = os.listdir(folder_path)
filelist.sort() #Added this line
print(filelist)
By default, python already sorts strings in lexicographical order, but uppercase letters are all sorted before lowercase letters. If you want to sort strings and ignore case, then you can do
new_filelist = sorted(filelist, key=str.lower)
You can create a custom function for this, that creates a tuple of ints from the filenames:
>>> def sl_no(s):
return tuple(map(int,s.split('.')[0].rsplit('-', 2)[-2:]))
>>> sl_no("text-1-1.txt")
(1, 1)
>>> sorted(filelist, key=sl_no)
['text-1-1.txt',
'txt-1-10.txt',
'txt-1-23.txt',
'txt-1-32.txt',
'txt-2-1.txt',
'txt-2-32.txt']
Or, you can use re:
>>> import re
>>> sorted(filelist, lambda x: tuple(re.findall(r'\d+', x)))
['text-1-1.txt',
'txt-1-10.txt',
'txt-1-23.txt',
'txt-1-32.txt',
'txt-2-1.txt',
'txt-2-32.txt']
in order to support all kinds of file names that contain numbers, you can define a sortKey function that will isolate the numeric parts of the names and right justify them (with leading zeros) for the purpose of sorting:
import re
def sortKey(n):
return "".join([s,f"{s:>010}"][s.isdigit()] for s in re.split(r"(\d+)",n))
output:
names = ["text-1-1.txt","txt-1-23.txt","txt-1-32.txt","txt-1-10.txt",
"txt-2-1.txt","txt-2-32.txt"]
print(sorted(names,key=sortKey))
# ['text-1-1.txt', 'txt-1-10.txt', 'txt-1-23.txt', 'txt-1-32.txt',
# 'txt-2-1.txt', 'txt-2-32.txt']
names = ["log2020/12/23.txt","log2021/1/3.txt","log2021/02/1.txt",
"log2021/1/1.txt","log2021/1/13.txt"]
print(sorted(names,key=sortKey))
# ['log2020/12/23.txt', 'log2021/1/1.txt', 'log2021/1/3.txt',
# 'log2021/1/13.txt', 'log2021/02/1.txt']
I have various tar files in a Desktop folder (Ubuntu).
The filename is like this:
esarchive--James-AB-Test226-8037affd-06d1-4c61-a91f-816ec9cb825f-**05222017**-4.tar
The boldfaced part is the date. I want to sort the files in date order, most recent first.
Is there a simple python solution for this?
import glob
import datetime
import re
timeformat = "%m%d%Y"
regex = re.compile("^esarchive--James-AB-Test226-8037affd-06d1-4c61-a91f-816ec9cb825f-(\d*\d*)")
def gettimestamp(thestring):
m = regex.search(thestring)
return datetime.datetime.strptime(m.groups()[0], timeformat)
list_of_filenames = ['esarchive--James-AB-Test226-8037affd-06d1-4c61-a91f-816ec9cb825f-05212017-4','esarchive--James-AB-Test226-8037affd-06d1-4c61-a91f-816ec9cb825f-05202017-4']
for fn in sorted(list_of_filenames, key=gettimestamp):
print fn
No, there is not a simple Python function for this. However, there are reasonably simple building blocks from which you can make a readable solution.
Write a function to extract the date and rearrange it to be useful as a sort key. Find the last two hyphens in the file name, grab the string between them, and then rearrange the digits in the format yyyymmdd (year-month-day). Return that string or integer (either will work) as the functional value.
For your main routine, collect all the file names in a list (or make a generator) and sort them, using the value of that function as the sort key.
See the sorting wiki for some implementation details.
As Adam Smith have pointed out, you require the list of files to work with.
import glob, os
import datetime
import re
timeformat = "%m%d%Y"
regex = re.compile("(\d*\d*)-\d*.tar")
def gettimestamp(thestring):
m = regex.search(thestring[-14:-1])
if m:
return datetime.datetime.strptime(m.groups()[0], timeformat)
else:
return None
list_of_filenames = os.listdir('/home/james/Desktop/tarfolder')
for fn in sorted(list_of_filenames, key=gettimestamp):
print fn
Edit As Martineu has noticed, the hash might be different than the one you indicated so it would be easier to discard beginning of the name part in advance.
You don't need to parse the date, or even use regex for that matter. If the file names are structured as you say, it's sufficient to do just:
filenames = ['esarchive--James-AB-Test226-8037affd-06d1-4c61-a91f-816ec9cb825f-05212017-4',
'esarchive--James-AB-Test226-8037affd-06d1-4c61-a91f-816ec9cb825f-05202017-4',
'esarchive--James-AB-Test226-8037affd-06d1-4c61-a91f-816ec9cb825f-bad_date-4',]
def parse_date(name, offset=-10):
try:
date_str = name[offset:offset+8]
return int(date_str[-4:] + date_str[:2] + date_str[2:4])
except (IndexError, TypeError, ValueError): # invalid file name
return -1
sorted_list = [x[1] for x in sorted((parse_date(l), l) for l in filenames) if x[0] != -1]
# ['esarchive--James-AB-Test226-8037affd-06d1-4c61-a91f-816ec9cb825f-05202017-4',
# 'esarchive--James-AB-Test226-8037affd-06d1-4c61-a91f-816ec9cb825f-05212017-4']
UPDATE - I've added the offset argument to specify where in the file name begins your date. In the list you've posted it begins 10 characters from the back (default), but if you've had a .tar extension after the name, as in your initial example, you'd account those 4 characters as well and use offset of -14:
names = ['James-AB-Test226-8037affd-06d1-4c61-a91f-816ec9cb825f-05212017-4.tar',
'James-AB-Test226-8037affd-06d1-4c61-a91f-816ec9cb825f-05202017-4.tar',
'James-AB-Test226-8037affd-06d1-4c61-a91f-816ec9cb825f-bad_date-4.tar']
sorted_list = [x[1] for x in sorted((parse_date(l, -14), l) for l in names) if x[0] != -1]
# ['James-AB-Test226-8037affd-06d1-4c61-a91f-816ec9cb825f-05202017-4.tar',
# 'James-AB-Test226-8037affd-06d1-4c61-a91f-816ec9cb825f-05212017-4.tar']
My question is closely related to Python identify file with largest number as part of filename
I want to append files to a certain directory. The name of the files are: file1, file2......file^n. This works if i do it in one go, but when i want to add files again, and want to find the last file added (in this case the file with the highest number), it recognises 'file6' to be higher than 'file100'.
How can i solve this.
import glob
import os
latest_file = max(sorted(list_of_files, key=os.path.getctime))
print latest_file
As you can see i tried looking at created time and i also tried looking at modified time, but these can be the same so that doesn't help.
EDIT my filenames have the extention ".txt" after the number
I'll try to solve it only using filenames, not dates.
You have to convert to integer before appling criteria or alphanum sort applies to the whole filename
Proof of concept:
import re
list_of_files = ["file1","file100","file4","file7"]
def extract_number(f):
s = re.findall("\d+$",f)
return (int(s[0]) if s else -1,f)
print(max(list_of_files,key=extract_number))
result: file100
the key function extracts the digits found at the end of the file and converts to integer, and if nothing is found returns -1
you don't need to sort to find the max, just pass the key to max directly
if 2 files have the same index, use full filename to break tie (which explains the tuple key)
Using the following regular expression you can get the number of each file:
import re
maxn = 0
for file in list_of_files:
num = int(re.search('file(\d*)', file).group(1)) # assuming filename is "filexxx.txt"
# compare num to previous max, e.g.
maxn = num if num > maxn else maxn
At the end of the loop, maxn will be your highest filename number.
I have a directory with files that follow the format: LnLnnnnLnnn.txt
where L = letters and n = numbers. E.g: p2c0789c001.txt
I would like to separate these files based on whether the second number (i.e. 0789) is odd or even.
I've only managed to get this to work if the second number ranges between 0001-0009 using the code:
odd_files = []
for root, dirs, filenames in os.walk('.'):
for filename in fnmatch.filter(filenames, 'p2c000[13579]*.txt'):
odd_files.append(os.path.join(root, filename))
This will return the files: ['./p2c0001c054.txt', './p2c0003c055.txt', './p2c0005c056.txt', './p2c0007c057.txt', './p2c0009c058.txt']
Any suggestion how could I get this to work for any given four digit number?
The easiest solution would be to expand your wildcard to match a wider array of things.
to that end I would probably do something like:
for filename in fnmatch.filter(filenames, '??????[13579]*.txt'):
This will match any characters before your values, it will match any of the odd values in your wildcard class and then it will accept anything to match afterwards.
This is a bit gross because as it is aaaaaaaa3alkjfdhalkjfshglkjzsdhfgs.txt would match and that is super gross. If you know that the data in the directories you are walking is well controlled that might be ok. A better solution might be to specify things a bit more. This could be done with the following expression:
'[a-z][0-0][a-z][0-9][0-9][0-9][13579][a-z][0-9][0-9][0-9].txt'
The fnmatch.filter method using Unix style wildcards. That means you can use the following:
? - match any single character
* - matches anything from nothing to everything
[] - this matches a class of things, use a - for a range and ! for exclusion
Would this do it?
import re
regex = re.compile("[a-z][0-9][a-z]([0-9]{4})[a-z][0-9]{3}.txt")
filter(lambda x: int(regex.match(x).groups()[0]) % 2 == 1, fnmatch)
If it's getting a little hairy, you could always turn that into a generator and code the tests by hand:
def odd_files_generator():
for root, dirs, filenames in os.walk('.'):
for filename in filenames:
if filename[6] in '13579':
yield filename
odd_files = list(odd_files_generator)
If your test is growing exceedingly hard to express tersely, replace the if filename ... line with your explicit test code.
There's no particular magic to constructing this kind of filter. It just
requires carefully constructing the appropriate regular expression and testing
against it. When using complex patterns with a lot of repetitive components,
errors can easily creep in. I like to define helper functions that make the
specification more human-readable and easier to modify later if need be.
import re
import os
# helper functions for legible re construction
LETTER = lambda n='': '({0}{1})'.format('[A-Za-z]', n)
NUM = lambda n='': '({0}{1})'.format('\d', n)
FILENAME = LETTER() + NUM() + LETTER() + NUM('{4}') + LETTER() + NUM('{3}') + '\.txt'
FILENAME_RE = re.compile(FILENAME)
is_odd = lambda n: int(n) % 2 > 0
def odd_nnnn(f):
"""
Determine if the given filename `f` matches our desired LnLnnnnLnnn.txt pattern
with the second group of numbers (nnnn) odd.
"""
m = FILENAME_RE.search(f)
return m is not None and is_odd(m.group(4))
if __name__ == '__main__':
print "Search pattern:", FILENAME
files = ['./p2c0001c054.txt', './p2c0001c055.txt', './p2c0003c055.txt', './p2c0005c056.txt', './p2c0022c056.txt', './p2c0004c056.txt', './p2c0007c057.txt', './p2c0009c058.txt', './p2c8888c056.txt', ]
files = [ os.path.normpath(f) for f in files ]
root = '/users/test/whatever'
odd_paths = [ os.path.join(root, f) for f in files if odd_nnnn(f) ]
print odd_paths
The only real downside to this is that it's a little more verbose, especially compared to a hyper-compact answer like Brad Beattie's.
[Update] It later occurred to me that a more compact way to define the regular expression might be:
FILENAME = "LnL(nnnn)Lnnn\.txt"
FILENAME_PAT = FILENAME.replace('L', r'[A-Za-z]').replace('n', r'\d')
FILENAME_RE = re.compile(FILENAME_PAT)
This more closely follows the original 'LnLnnnLnnn.txt' description. The match expression would have to change from m.group(4) to m.group(1), because just one group is captured this way.
I would like to ask how to efficiently handle accessing of filenames in a folder in the right order (alphabetical and increasing in number).
For example, I have the following files in a folder: apple1.dat, apple2.dat, apple10.dat, banana1.dat, banana2.dat, banana10.dat. I would like to read the contents of the files such that apple1.dat will be read first and banana10.dat will be read last.
Thanks.
This is what I did so far.
from glob import glob
files=glob('*.dat')
for list in files
# I read the files here in order
But as pointed out, apple10.dat comes before apple2.dat
from glob import glob
import os
files_list = glob(os.path.join(my_folder, '*.dat'))
for a_file in sorted(files_list):
# do whatever with the file
# 'open' or 'with' statements depending on your python version
try this one.
import os
def get_sorted_files(Directory)
filenamelist = []
for root, dirs, files in os.walk(Directory):
for name in files:
fullname = os.path.join(root, name)
filenamelist.append(fullname)
return sorted(filenamelist)
You have to cast the numbers to an int first. Doing it the long way would require breaking the names into the strings and numbers, casting the numbers to an int and sorting. Perhaps someone else has a shorter or more efficient way.
def split_in_two(str_in):
## go from right to left until a letter is found
## assume first letter of name is not a digit
for ctr in range(len(str_in)-1, 0, -1):
if not str_in[ctr].isdigit():
return str_in[:ctr+1], str_in[ctr+1:] ## ctr+1 = first digit
## default for no letters found
return str_in, "0"
files=['apple1.dat', 'apple2.dat', 'apple10.dat', 'apple11.dat',
'banana1.dat', 'banana10.dat', 'banana2.dat']
print sorted(files) ## sorted as you say
sort_numbers = []
for f in files:
## split off '.dat.
no_ending = f[:-4]
str_1, str_2 = split_in_two(no_ending)
sort_numbers.append([str_1, int(str_2), ".dat"])
sort_numbers.sort()
print sort_numbers