I have a list of ~1000+ values in it. The values are the names of files in a folder which is given by os.listdir(folder_path)
code looks like this:
import os
folder_path = "some path here"
filelist = os.listdir(folder_path)
print(filelist)
Now when I look at the printed list, I see that the list isn't sorted by name. The filenames are something like ["text-1-1.txt","txt-1-23.txt","txt-1-32.txt","txt-1-10.txt","txt-2-1.txt","txt-2-32.txt"...]
Also, I know that there are filenames that increment by one, like: text-1-1.txt, text-1-2.txt, text-1-3.txt,.... text-2-1.txt, text-2-2.txt,...
I have tried these two methods to try and sort the list: new_list = sorted(filelist) & filelist.sort()
Both did not work and the list came out to be the same as the original, how can I sort this list? Do I have to manually write sorting algorithms(like Bubble, or Selection)?
You can run it this way:
import os
folder_path = "some path here"
filelist = os.listdir(folder_path)
filelist.sort() #Added this line
print(filelist)
By default, python already sorts strings in lexicographical order, but uppercase letters are all sorted before lowercase letters. If you want to sort strings and ignore case, then you can do
new_filelist = sorted(filelist, key=str.lower)
You can create a custom function for this, that creates a tuple of ints from the filenames:
>>> def sl_no(s):
return tuple(map(int,s.split('.')[0].rsplit('-', 2)[-2:]))
>>> sl_no("text-1-1.txt")
(1, 1)
>>> sorted(filelist, key=sl_no)
['text-1-1.txt',
'txt-1-10.txt',
'txt-1-23.txt',
'txt-1-32.txt',
'txt-2-1.txt',
'txt-2-32.txt']
Or, you can use re:
>>> import re
>>> sorted(filelist, lambda x: tuple(re.findall(r'\d+', x)))
['text-1-1.txt',
'txt-1-10.txt',
'txt-1-23.txt',
'txt-1-32.txt',
'txt-2-1.txt',
'txt-2-32.txt']
in order to support all kinds of file names that contain numbers, you can define a sortKey function that will isolate the numeric parts of the names and right justify them (with leading zeros) for the purpose of sorting:
import re
def sortKey(n):
return "".join([s,f"{s:>010}"][s.isdigit()] for s in re.split(r"(\d+)",n))
output:
names = ["text-1-1.txt","txt-1-23.txt","txt-1-32.txt","txt-1-10.txt",
"txt-2-1.txt","txt-2-32.txt"]
print(sorted(names,key=sortKey))
# ['text-1-1.txt', 'txt-1-10.txt', 'txt-1-23.txt', 'txt-1-32.txt',
# 'txt-2-1.txt', 'txt-2-32.txt']
names = ["log2020/12/23.txt","log2021/1/3.txt","log2021/02/1.txt",
"log2021/1/1.txt","log2021/1/13.txt"]
print(sorted(names,key=sortKey))
# ['log2020/12/23.txt', 'log2021/1/1.txt', 'log2021/1/3.txt',
# 'log2021/1/13.txt', 'log2021/02/1.txt']
Related
I have a folder that contains some xml files. I'm trying to read those files and store it in a list in ascending order. I have written the following codes, nonetheless, I'm not sure how to do it. The folder contains files such as:
a.xml_1
a.xml_2
a.xml_3
...
When I run the following codes, the created list is not ordered.
import os
path = 'mypath/folder/'
xml_files=[]
files = os.listdir(path)
for f in files:
xml_files=[f]
print(xml_files)
import os
print (sorted(os.listdir('./')))
You can also use glob btw:
import glob
print (sorted(glob.glob('./*')))
You may have troubles if you wish to sort alphanumerical strings. There is a famous function for that:
import re
def sorted_nicely( l ):
""" Sort the given iterable in the way that humans expect."""
convert = lambda text: int(text) if text.isdigit() else text
alphanum_key = lambda key: [ convert(c) for c in re.split('([0-9]+)', key) ]
return sorted(l, key = alphanum_key)
You can then use:
print (sorted_nicely(os.listdir('./')))
Try this one, it will give you all the files sorted in ascending order in a single list
import os
path = 'mypath/folder/'
xml_files=[]
files = os.listdir(path)
for f in files:
xml_files.append(f)
xml_files.sort()
print(xml_files)
I want to create a list of all the filepath names that match a specific string e.g. "04_DEM" so I can do further processing on the files inside those directories?
e.g.
INPUT
C:\directory\NewZealand\04DEM\DEM_CD23_1232.tif
C:\directory\Australia\04DEM\DEM_CD23_1233.tif
C:\directory\NewZealand\05DSM\DSM_CD23_1232.tif
C:\directory\Australia\05DSM\DSM_CD23_1232.tif
WANTED OUTPUT
C:\directory\NewZealand\04DEM\
C:\directory\Australia\04DEM\
This makes sure that only those files are processed, as some other files in the directories also have the same string "DEM" included in their filename, which I do not want to modify.
This is my bad attempt due to being a rookie with Py code
import os
for dirnames in os.walk('D:\Canterbury_2017Copy'):
print dirnames
if dirnames=='04_DEM' > listofdirectoriestoprocess.txt
print "DONE CHECK TEXT FILE"
You can use os.path for this:
import os
lst = [r'C:\directory\NewZealand\04DEM\DEM_CD23_1232.tif',
r'C:\directory\Australia\04DEM\DEM_CD23_1233.tif',
r'C:\directory\NewZealand\05DSM\DSM_CD23_1232.tif',
r'C:\directory\Australia\05DSM\DSM_CD23_1232.tif']
def filter_paths(lst, x):
return [os.path.split(i)[0] for i in lst if os.path.normpath(i).split(os.sep)[3] == x]
res = list(filter_paths(lst, '04DEM'))
# ['C:\\directory\\NewZealand\\04DEM',
# 'C:\\directory\\Australia\\04DEM']
Use in to check if a required string is in another string.
This is one quick way:
new_list = []
for path in path_list:
if '04DEM' in path:
new_list.append(path)
Demo:
s = 'C:/directory/NewZealand/04DEM/DEM_CD23_1232.tif'
if '04DEM' in s:
print(True)
# True
Make sure you use / or \\ as directory separator instead of \ because the latter escapes characters.
First, you select via regex using re, and then use pathlib:
import re
import pathlib
pattern = re.compile('04DEM')
# You use pattern.search() if s is IN the string
# You use pattern.match() if s COMPLETELY matches the string.
# Apply the correct function to your use case.
files = [s in list_of_files if pattern.search(s)]
all_pruned_paths = set()
for p in files:
total = ""
for d in pathlib.Path(p):
total = os.path.join(total, d)
if pattern.search(s):
break
all_pruned_paths.add(total)
result = list(all_pruned_paths)
This is more robust than using in because you might need to form more complicated queries in the future.
I am trying to associate some filepaths from 2 list elements in Python. These files have a part of their name identical, while the extension and some extra words are different.
This means the extension of the file, extra characters and their location can differ. The files are in different folders, hence their filepath name differs. What is exactly equal: their Numbering index: 0033, 0061 for example.
Example code:
original_files = ['C:/0001.jpg',
'C:/0033.jpg',
'C:/0061.jpg',
'C:/0080.jpg',
'C:/0204.jpg',
'C:/0241.jpg']
related_files = ['C:/0001_PM.png',
'C:/0033_PMA.png',
'C:/0033_NM.png',
'C:/0061_PMLTS.png',
'C:/0080_PM.png',
'C:/0080_RS.png',
'C:/0204_PM.png']
for idx, filename in enumerate(original_files):
related_filename = [s for s in (related_files) if filename.rsplit('/',1)[1][:-4] in s]
print(related_filename)
At filename = 'C:/0241.jpg' it should return [], but instead it returns all the filenames from related_files.
For privacy reasons I didn't post the entire filepath, just the names of the files. In this example, the comparison works, but for the entire filepath it fails.
I suppose my comparison condition is not correct but I don't know how to write it.
Note: I am looking for something with as few code lines as possible to do this.
I suggest something along the line of
from collections import defaultdict
original_files = ['C:/0001.jpg',
'C:/0033.jpg',
'C:/0061.jpg',
'C:/0080.jpg',
'C:/0204.jpg',
'C:/0241.jpg']
related_files = ['C:/0001_PM.png',
'C:/0033_PMA.png',
'C:/0033_NM.png',
'C:/0061_PMLTS.png',
'C:/0080_PM.png',
'C:/0080_RS.png',
'C:/0204_PM.png']
def key1(filename):
return filename.rsplit('/', 1)[-1].rsplit('.', 1)[0]
def key2(filename):
return key1(filename).split('_', 1)[0]
d = defaultdict(list)
for x in related_files:
d[key2(x)].append(x)
for x in original_files:
related = d.get(key1(x), [])
print(x, '->', related)
In key1() and key2() you could alternately use os.path functions or pathlib.Path methods.
Here's a solution that returns only the matched relative_files.
import os, re
def get_index(filename):
m = re.match('([0-9]+)', os.path.split(filename)[1])
return m.group(1) if m else False
indexes = filter(bool, map(get_index, original_files))
[f for f in related_files if get_index(f) in indexes]
Make use of defaultdict.
import os, re
from collections import defaultdict
stragglers = []
grouped_files = defaultdict(list)
file_index = re.compile('([0-9]+)')
for f in original_files + related_files:
m = file_index.match(os.path.split(f)[1])
if m:
grouped_files[m.group(1)].append(f)
else:
stragglers.append(f)
You now have grouped_files, a dict (or dictionary-like object) of key-value pairs where the key is the regex matched part of the filename and the value is a list of matching filenames.
for x in grouped_files.items():
print(x)
# ('0204', ['C:/0204.jpg', 'C:/0204_PM.png'])
# ('0001', ['C:/0001.jpg', 'C:/0001_PM.png'])
# ('0033', ['C:/0033.jpg', 'C:/0033_PM.png'])
# ('0061', ['C:/0061.jpg', 'C:/0061_PM.png'])
# ('0241', ['C:/0241.jpg'])
# ('0080', ['C:/0080.jpg', 'C:/0080_PM.png'])
In stragglers you have any filenames that didn't match your regex.
print(stragglers)
# []
For python 3.X you can try to use this:
for origfiles in original_files:
for relfiles in related_files:
if origfiles[3:6] == relfiles[3:6]:
print(origfiles)
I have a very basic question. I have files named like Dipole_E0=1.2625E-01.dat and I want to extract the 1.2625E-01 part and finally sort them by ascending order. How can this be done ? I tried first to plit the filename with .split() but it does not what I expect. Thanks for your help.
Best
Roland
Best way is to use regexp. To obtain value from file name:
m = re.search(filename, '^Dipole_E0=(.*)/s?')
val = m.group(0)
Walk through all dilenames and append all values to array. After that sort and that's all.
You want to look into regular expressions. In python they live in the re module. Depending on exact format, something like:
import re
ematch = re.compile("=([0-9]*\.[0-9]*[eE][+-][0-9]+)")
val = ematch.search(filename).group(0)
Sorting a list can be done with the .sort() method on lists, or the sorted(list) builtin, which give you a new list.
This is a good situation to use a generator expression and the sorted builtin:
sorted(float(filename.split("=", 1)[1].rsplit(".", 1)[0]) for filename in filenames)
Where filenames is your list of filenames.
>>> filenames = ["Dipole_E0=1.2625E-01.dat", "Dipole_E0=1.3625E-01.dat", "Dipole_E0=0.2625E-01.dat"]
>>> sorted(float(filename.split("=", 1)[1].rsplit(".", 1)[0]) for filename in filenames)
[0.02625, 0.12625, 0.13625]
You can get the filenames with the glob module.
from glob import glob
file_names = glob("yourpath/*.dat")
vals = []
for name in file_names:
vals.append(float(name[:-4].rpartition("=")[2]))
vals.sort()
name[:-4] throws away the ".dat". rpartition is a string method. It returns a tuple where entry 0 is the string left of the string used to split, entry 1 is the string used to split (here: "=") and entry 2 is the string right of this string (here: your float). Then it is converted to a float and appended to the list of values.
My code reads a directory and stores the filename with extention into a list. What I am trying to do is get rid of the extention with replace. However, it is not saving into the list.
print projectFilenames
for f in projectFilenames:
print f
f = f.replace('.txt','')
f = f.replace('.mdown','')
f = f.replace('.markdown','')
print f
print projectFilenames
and this is my output
['2010-10-30-markdown-example.txt', '2010-12-29-hello-world.mdown', '2011-1-1-tester.markdown']
2010-10-30-markdown-example.txt
2010-10-30-markdown-example
2010-12-29-hello-world.mdown
2010-12-29-hello-world
2011-1-1-tester.markdown
2011-1-1-tester
['2010-10-30-markdown-example.txt', '2010-12-29-hello-world.mdown', '2011-1-1-tester.markdown']
What am I doing wrong?
The list doesn't change, because you aren't updating it. You aren't touching the projectFilenames list in any way (nor the strings in the list. Python variables are not pointers) Here is one way to do it:
newlist = []
for f in projectFilenames:
f = f.replace('.txt','')
f = f.replace('.mdown','')
f = f.replace('.markdown','')
newlist.append(f)
projectFilenames = newlist
Also, look at the os.path module, there are functions there to cut off file extensions. os.path.splitext() specifically. So another way of doing it would be:
newlist = []
for f in projectFilenames:
f = os.path.splitext(f)[0]
newlist.append(f)
projectFilenames = newlist
That in turn can be simplified to (and made compliant with PEP 8):
>>> import os
>>> project_filenames = ['2010-10-30-markdown-example.txt', '2010-12-29-hello-world.mdown', '2011-1-1-tester.markdown']
>>> project_filenames = [os.path.splitext(f)[0] for f in project_filenames]
>>> project_filenames
['2010-10-30-markdown-example', '2010-12-29-hello-world', '2011-1-1-tester']
You are replacing the f object not the projectFileNames , do it like this;
>>> [x.split(".")[0] for x in projectFileNames]
['2010-10-30-markdown-example', '2010-12-29-hello-world', '2011-1-1-tester']
Removes the extension.
list slice replacement using list comprehension (updates contents of original list):
l[:] = [f.replace('.txt','').replace('.mdown','').replace('.markdown','') for f in l]
splitext (suggested by #lennart, # sukhbir)
import os
l[:] = [os.path.splitext(f)[0] for f in l]
using enumerate:
import os
for idx, filename in enumerate(l):
newfilename = os.path.splitext(filename)[0]
if newfilename != filename:
l[idx] = newfilename
Anecdotal: splitext()[0], str.replace('.txt') and str.split('.txt')[0] return the original string untouched if it has no extension or match. (at least in Python 2.6)
As mentioned in other answers, the reason the filename list doesn't change is because your code doesn't change it. There's a number of ways of fixing that, including building a new list and replacing the original with it.
I think the simplest approach would be to just modify the list as you're iterating over its elements (but you have to be careful when doing this that you don't modify parts of the list not yet seen). Python has a built-in function called enumerate() which makes this kind of task easy to code. What enumerate() does is return an "iterator" object which counts out the items as it provides each one from the sequence -- so the count is also the item's index in that sequence, and when necessary that index can be used to update the corresponding list element.
Since your code is dealing with filenames, it could be improved further by making use of the built-in module named os.path which is available for dealing with paths and filenames. That module has a splitext() function for breaking filenames into two parts, the "root" and the extension. It's called root instead of filename in the documentation because it could have directory path information prefixed onto it.
If your code was rewritten using enumerate() and os.path.splitext() it might look something like this:
import os
projectFilenames = ['2010-10-30-markdown-example.txt',
'2010-12-29-hello-world.mdown',
'2011-1-1-tester.markdown']
for i,f in enumerate(projectFilenames):
root,ext = os.path.splitext(f)
if ext in ('.txt', '.mdown', '.markdown'):
projectFilenames[i] = root # update filename list leaving ext off
print projectFilenames
# ['2010-10-30-markdown-example', '2010-12-29-hello-world', '2011-1-1-tester']
If you wanted to remove all file extensions, not just specific ones, you could just change the if ext in ('.txt', '.mdown', '.markdown'): to just if ext:.
Such operations on files should always be done with os.path.splitext. Because it is easier to maintain, portable and you don't have to reinvent the wheel.
>>> os.path.splitext('/home/Desktop/foo.py')[0]
'/home/Desktop/foo'
So say that you have a list, x, which has all all the files:
[os.path.splitext(files)[0] for files in x]