Extract and sort numbers from filnames in python - python

I have a very basic question. I have files named like Dipole_E0=1.2625E-01.dat and I want to extract the 1.2625E-01 part and finally sort them by ascending order. How can this be done ? I tried first to plit the filename with .split() but it does not what I expect. Thanks for your help.
Best
Roland

Best way is to use regexp. To obtain value from file name:
m = re.search(filename, '^Dipole_E0=(.*)/s?')
val = m.group(0)
Walk through all dilenames and append all values to array. After that sort and that's all.

You want to look into regular expressions. In python they live in the re module. Depending on exact format, something like:
import re
ematch = re.compile("=([0-9]*\.[0-9]*[eE][+-][0-9]+)")
val = ematch.search(filename).group(0)
Sorting a list can be done with the .sort() method on lists, or the sorted(list) builtin, which give you a new list.

This is a good situation to use a generator expression and the sorted builtin:
sorted(float(filename.split("=", 1)[1].rsplit(".", 1)[0]) for filename in filenames)
Where filenames is your list of filenames.
>>> filenames = ["Dipole_E0=1.2625E-01.dat", "Dipole_E0=1.3625E-01.dat", "Dipole_E0=0.2625E-01.dat"]
>>> sorted(float(filename.split("=", 1)[1].rsplit(".", 1)[0]) for filename in filenames)
[0.02625, 0.12625, 0.13625]

You can get the filenames with the glob module.
from glob import glob
file_names = glob("yourpath/*.dat")
vals = []
for name in file_names:
vals.append(float(name[:-4].rpartition("=")[2]))
vals.sort()
name[:-4] throws away the ".dat". rpartition is a string method. It returns a tuple where entry 0 is the string left of the string used to split, entry 1 is the string used to split (here: "=") and entry 2 is the string right of this string (here: your float). Then it is converted to a float and appended to the list of values.

Related

how to sort a list containing filenames?

I have a list of ~1000+ values in it. The values are the names of files in a folder which is given by os.listdir(folder_path)
code looks like this:
import os
folder_path = "some path here"
filelist = os.listdir(folder_path)
print(filelist)
Now when I look at the printed list, I see that the list isn't sorted by name. The filenames are something like ["text-1-1.txt","txt-1-23.txt","txt-1-32.txt","txt-1-10.txt","txt-2-1.txt","txt-2-32.txt"...]
Also, I know that there are filenames that increment by one, like: text-1-1.txt, text-1-2.txt, text-1-3.txt,.... text-2-1.txt, text-2-2.txt,...
I have tried these two methods to try and sort the list: new_list = sorted(filelist) & filelist.sort()
Both did not work and the list came out to be the same as the original, how can I sort this list? Do I have to manually write sorting algorithms(like Bubble, or Selection)?
You can run it this way:
import os
folder_path = "some path here"
filelist = os.listdir(folder_path)
filelist.sort() #Added this line
print(filelist)
By default, python already sorts strings in lexicographical order, but uppercase letters are all sorted before lowercase letters. If you want to sort strings and ignore case, then you can do
new_filelist = sorted(filelist, key=str.lower)
You can create a custom function for this, that creates a tuple of ints from the filenames:
>>> def sl_no(s):
return tuple(map(int,s.split('.')[0].rsplit('-', 2)[-2:]))
>>> sl_no("text-1-1.txt")
(1, 1)
>>> sorted(filelist, key=sl_no)
['text-1-1.txt',
'txt-1-10.txt',
'txt-1-23.txt',
'txt-1-32.txt',
'txt-2-1.txt',
'txt-2-32.txt']
Or, you can use re:
>>> import re
>>> sorted(filelist, lambda x: tuple(re.findall(r'\d+', x)))
['text-1-1.txt',
'txt-1-10.txt',
'txt-1-23.txt',
'txt-1-32.txt',
'txt-2-1.txt',
'txt-2-32.txt']
in order to support all kinds of file names that contain numbers, you can define a sortKey function that will isolate the numeric parts of the names and right justify them (with leading zeros) for the purpose of sorting:
import re
def sortKey(n):
return "".join([s,f"{s:>010}"][s.isdigit()] for s in re.split(r"(\d+)",n))
output:
names = ["text-1-1.txt","txt-1-23.txt","txt-1-32.txt","txt-1-10.txt",
"txt-2-1.txt","txt-2-32.txt"]
print(sorted(names,key=sortKey))
# ['text-1-1.txt', 'txt-1-10.txt', 'txt-1-23.txt', 'txt-1-32.txt',
# 'txt-2-1.txt', 'txt-2-32.txt']
names = ["log2020/12/23.txt","log2021/1/3.txt","log2021/02/1.txt",
"log2021/1/1.txt","log2021/1/13.txt"]
print(sorted(names,key=sortKey))
# ['log2020/12/23.txt', 'log2021/1/1.txt', 'log2021/1/3.txt',
# 'log2021/1/13.txt', 'log2021/02/1.txt']

Extracting numbers from a filename string in python

I have a number of html files in a directory. I am trying to store the filenames in a list so that I can use it later to compare with another list.
Eg: Prod224_0055_00007464_20170930.html is one of the filenames. From the filename, I want to extract '00007464' and store this value in a list and repeat the same for all the other files in the directory. How do I go about doing this? I am new to Python and any help would be greatly appreciated!
Please let me know if you need more information to answer the question.
Split the filename on underscores and select the third element (index 2).
>>> 'Prod224_0055_00007464_20170930.html'.split('_')[2]
'00007464'
In context that might look like this:
nums = [f.split('_')[2] for f in os.listdir(dir) if f.endswith('.html')]
you may try this (assuming you are in the folder with the files:
import os
num_list = []
r, d, files = os.walk( '.' ).next()
for f in files :
parts = f.split('_') # now `parts` contains ['Prod224', '0055', '00007464', '20170930.html']
print parts[2] # this outputs '00007464'
num_list.append( parts[2] )
Assuming you have a certain pattern for your files, you can use a regex:
>>> import re
>>> s = 'Prod224_0055_00007464_20170930.html'
>>> desired_number = re.findall("\d+", s)[2]
>>> desired_number
'00007464'
Using a regex will help you getting not only that specific number you want, but also other numbers in the file name.
This will work if the name of your files follow the pattern "[some text][number]_[number]_[desired_number]_[a date].html". After getting the number, I think it will be very simple to use the append method to add that number to any list you want.

Extract list of words from filenames

I need to get a list of words, that files contains. Here is the files:
sub-Dzh_task-FmriPictures_space-MNI152NLin2009cAsym_desc-preproc_bold_mask-Language_sub01_component_ica_s1_.nii
sub-Dzh_task-FmriVernike_space-MNI152NLin2009cAsym_desc-preproc_bold_mask-Language_sub01_component_ica_s1_.nii
sub-Dzh_task-FmriWgWords_space-MNI152NLin2009cAsym_desc-preproc_bold_mask-Language_sub01_component_ica_s1_.nii
sub-Dzh_task-RestingState_space-MNI152NLin2009cAsym_desc-preproc_bold_mask-Language_sub01_component_ica_s1_.nii
I need to get that goes after task-<>_, so my list should looks:
['FmriPictures','FmriVernike','FmriWgWords','RestingState']
how can I implement it in python3?
Here's a Python Solution for this which uses Regex.
>>> import re
>>> test_str = 'sub-Dzh_task-FmriPictures_space-
MNI152NLin2009cAsym_desc-preproc_bold_mask-
Language_sub01_component_ica_s1_.nii'
>>> re.search('task-(.*?)_', test_str).group(1)
'FmriPictures'
I think you can do the same for every string.
l=["sub-Dzh_task-FmriPictures_space-MNI152NLin2009cAsym_desc-preproc_bold_mask-Language_sub01_component_ica_s1_.nii",
"sub-Dzh_task-FmriVernike_space-MNI152NLin2009cAsym_desc-preproc_bold_mask-Language_sub01_component_ica_s1_.nii",
"sub-Dzh_task-FmriWgWords_space-MNI152NLin2009cAsym_desc-preproc_bold_mask-Language_sub01_component_ica_s1_.nii",
"sub-Dzh_task-RestingState_space-MNI152NLin2009cAsym_desc-preproc_bold_mask-Language_sub01_component_ica_s1_.nii"]
k=[]
for i in l:
k.append(i.split('-')[2].replace("_space",""))
print(k)
thats just approach.
You can loop over your list and use regex to get the names from the strings like this example:
import re
a = ['sub-Dzh_task-FmriPictures_space-MNI152NLin2009cAsym_desc-preproc_bold_mask-Language_sub01_component_ica_s1_.nii',
'sub-Dzh_task-FmriVernike_space-MNI152NLin2009cAsym_desc-preproc_bold_mask-Language_sub01_component_ica_s1_.nii',
'sub-Dzh_task-FmriWgWords_space-MNI152NLin2009cAsym_desc-preproc_bold_mask-Language_sub01_component_ica_s1_.nii',
'sub-Dzh_task-RestingState_space-MNI152NLin2009cAsym_desc-preproc_bold_mask-Language_sub01_component_ica_s1_.nii']
out = []
for elm in a:
condition = re.search(r'_task-(.*?)_', elm)
if bool(condition):
out.append(condition.group(1))
print(out)
Output:
['FmriPictures', 'FmriVernike', 'FmriWgWords', 'RestingState']
I would just simply replace
sub-Dzh_task-
and
_space-MNI152NLin2009cAsym_desc-preproc_bold_mask-Language_sub01_component_ica_s1_.nii
with null. Just empty those lines out and you'll get the file names.

Get file names into list using glob [duplicate]

This question already has answers here:
How do you sort files numerically?
(6 answers)
Closed 3 years ago.
I want to load file names into an array to load images from their paths. This job is done by the solution provided here. My code is something like this
fileNames = glob.glob(os.path.join(directory, "*.jpg"))
My filenames are something similar to this pattern
{videoNo}_{frameNo}_{patchNo).jpg
For example their names are like this
1_1_1.jpg
1_1_2.jpg
1_1_3.jpg
.
.
.
10_1_1.jpg
10_1_2.jpg
When I load filenames into fileNames array, they are like this
10_1_1.jpg
10_1_2.jpg
.
.
.
1_1_1.jpg
1_1_2.jpg
1_1_3.jpg
As far as I know this is because the asci code for _ is bigger than 0 and because of that the list of names is not sorted! I must work with the sorted list. Can anyone give me a hand here?
EDIT
Please notice that the sorted of these file names
["1_1_1.jpg", "10_1_3.jpg", "1_1_2.jpg", "10_1_2.jpg", "1_1_3.jpg", "1_20_1", "1_2_1", "1_14_1"]
is similar to this sorted list
["1_1_1.jpg", "1_1_2.jpg", "1_1_3.jpg", "1_2_1.jpg", "1_14_1", "1_20_1", "10_1_2.jpg", "10_1_3"]
The sorted builtin and list.sort method both take a key parameter that specifies how to do the sorting. If you want to sort by the numbers in the name (i.e. videoNo, then frameNo, then patchNo) you can split each name into these numbers:
fileNames = sorted(
glob.glob(os.path.join(directory, "*.jpg")),
key=lambda item: [
int(part) for part in os.path.splitext(item)[0].split('_')
],
)
The splitting strips off the .jpg extension, then cuts the name on each _. Conversion to int is needed because strings use lexicographic sorting, e.g. "2" > "10".
You could use a regular expression to extract the numbers from the file names and sort by those:
>>> import re
>>> files = ["10_1_3.jpg", "1_10_2.jpg", "3_1_1.jpg", "30_1_2.jpg"]
>>> sorted(files, key=lambda f: tuple(map(int, re.findall(r"\d+", f))))
['1_10_2.jpg', '3_1_1.jpg', '10_1_3.jpg', '30_1_2.jpg']

split based on multiple numbers in python

Can you help me figure out how to split based on multiple/group of number as delimiter?
I have content in a file in below format:
data_file_10572_2018-02-15-12-57-29.file
header_file_13238_2018-02-15-12-57-48.file
sig_file1_17678_2018-02-15-12-57-14.file
Expected output:
data_file
header_file
sig_file1
I'm new to python and I'm not sure how to cut based on group of number. Thanks for the reply!!
I hope this will help you. Method finds the element that can be casted to integer and return a string up to this value.
data = ['data_file_10572_2018-02-15-12-57-29.file', 'header_file_13238_2018-02-15-12-57-48.file', 'sig_file1_17678_2018-02-15-12-57-14.file']
def split_before_int(elem):
filename = elem.split('_')
for part in filename:
if not isinstance(part, (int)):
return '_'.join(filename[:filename.index(part)-2])
for elem in data:
print(split_before_int(elem))
Output:
data_file
header_file
sig_file1
First index to get the second location of the _ symbol, then python list partial indexing (i.e. list[0:5]) to get a substring up to the location of the second _.
files = ['data_file_10572_2018-02-15-12-57-29.file', 'header_file_13238_2018-02-15-12-57-48.file','sig_file1_17678_2018-02-15-12-57-14.file']
cleaned_files = list(map(lambda file: '_'.join(file.split('_')[0:2]), files))
This results in:
['data_file', 'header_file', 'sig_file1']
You can use the split by "_" with regex and then join the elements excluding the last
Ex:
import re
a = "data_file_10572_2018-02-15-12-57-29.file"
print "_".join(re.match("(.*?)_\d",a).group().split("_")[:-1])
output:
data_file
This code will work if all you filenames follow the pattern you described.
filename = 'data_file_10572_2018-02-15-12-57-29.file'
parts = filename.split('_')
new_filename = '_'.join(parts[:2])
If alphabetical part fo file name has variable number of underscores it's better to use Regex.
import re
pattern = re.compile('_[0-9_-]{3,}.file$')
re.sub(pattern, '', filename)
Output:
data_file
Essentially, first, it creates a pattern that starts with _, followed by 3 or more numbers, _ or - and ends with .file.
Then you replace the largest substring of you string that follows this pattern with an empty string.

Categories

Resources