Extracting numbers from a filename string in python

Extracting numbers from a filename string in python - python

I have a number of html files in a directory. I am trying to store the filenames in a list so that I can use it later to compare with another list.
Eg: Prod224_0055_00007464_20170930.html is one of the filenames. From the filename, I want to extract '00007464' and store this value in a list and repeat the same for all the other files in the directory. How do I go about doing this? I am new to Python and any help would be greatly appreciated!
Please let me know if you need more information to answer the question.

Split the filename on underscores and select the third element (index 2).
>>> 'Prod224_0055_00007464_20170930.html'.split('_')[2]
'00007464'
In context that might look like this:
nums = [f.split('_')[2] for f in os.listdir(dir) if f.endswith('.html')]

you may try this (assuming you are in the folder with the files:
import os
num_list = []
r, d, files = os.walk( '.' ).next()
for f in files :
parts = f.split('_') # now `parts` contains ['Prod224', '0055', '00007464', '20170930.html']
print parts[2] # this outputs '00007464'
num_list.append( parts[2] )

Assuming you have a certain pattern for your files, you can use a regex:
>>> import re
>>> s = 'Prod224_0055_00007464_20170930.html'
>>> desired_number = re.findall("\d+", s)[2]
>>> desired_number
'00007464'
Using a regex will help you getting not only that specific number you want, but also other numbers in the file name.
This will work if the name of your files follow the pattern "[some text][number]_[number]_[desired_number]_[a date].html". After getting the number, I think it will be very simple to use the append method to add that number to any list you want.

Related

searching specific string in list

How to search for every string in a list that starts with a specific string like:
path = (r"C:\Users\Example\Desktop")
desktop = os.listdir(path)
print(desktop)
#['faf.docx', 'faf.txt', 'faad.txt', 'gas.docx']
So my question is: how do i filter from every file that starts with "fa"?

For this specific cases, involving filenames in one directory, you can use globbing:
import glob
import os
path = (r"C:\Users\Example\Desktop")
pattern = os.path.join(path, 'fa*')
files = glob.glob(pattern)

This code filters all items out that start with "fa" and stores them in a separate list
filtered = [item for item in path if item.startswith("fa")]

All strings have a .startswith() method!
results = []
for value in os.listdir(path):
if value.startswith("fa"):
results.append(value)

python glob to match a wider range

Trying to match files on disk that either end with .asm,ASM, or with some 1/2/3 digit extension like - .asm.1/.asm.11
My python code is-
asmFiles = glob.glob('*.asm') + glob.glob('*.ASM') + glob.glob('*.asm.[0-9]') + glob.glob('*.ASM.[0-9]')
How do I match the file '.asm.11' as my code can only match the first three?
Thanks

Here is a solution using Python regex and list comprehension:
import re
files = ['foobar.asm', 'foobar.ASM', 'foobar.asm.1', 'foobar.ASM.11', 'foobarasm.csv']
asm_pattern = '\.(asm|ASM)$|(asm|ASM)\.[1-9]$|\.(asm|ASM)\.[1-9][1-9]$'
asmFiles = [f for f in files if re.search(asm_pattern, f)]
[print(asmFile) for asmFile in asmFiles]
The last element from list files is an edge case I thought about to test the search pattern. It does not appear in the result, as expected.

Python: how to search for specific "string" in directory name (not individual file names)

I want to create a list of all the filepath names that match a specific string e.g. "04_DEM" so I can do further processing on the files inside those directories?
e.g.
INPUT
C:\directory\NewZealand\04DEM\DEM_CD23_1232.tif
C:\directory\Australia\04DEM\DEM_CD23_1233.tif
C:\directory\NewZealand\05DSM\DSM_CD23_1232.tif
C:\directory\Australia\05DSM\DSM_CD23_1232.tif
WANTED OUTPUT
C:\directory\NewZealand\04DEM\
C:\directory\Australia\04DEM\
This makes sure that only those files are processed, as some other files in the directories also have the same string "DEM" included in their filename, which I do not want to modify.
This is my bad attempt due to being a rookie with Py code
import os
for dirnames in os.walk('D:\Canterbury_2017Copy'):
print dirnames
if dirnames=='04_DEM' > listofdirectoriestoprocess.txt
print "DONE CHECK TEXT FILE"

You can use os.path for this:
import os
lst = [r'C:\directory\NewZealand\04DEM\DEM_CD23_1232.tif',
r'C:\directory\Australia\04DEM\DEM_CD23_1233.tif',
r'C:\directory\NewZealand\05DSM\DSM_CD23_1232.tif',
r'C:\directory\Australia\05DSM\DSM_CD23_1232.tif']
def filter_paths(lst, x):
return [os.path.split(i)[0] for i in lst if os.path.normpath(i).split(os.sep)[3] == x]
res = list(filter_paths(lst, '04DEM'))
# ['C:\\directory\\NewZealand\\04DEM',
# 'C:\\directory\\Australia\\04DEM']

Use in to check if a required string is in another string.
This is one quick way:
new_list = []
for path in path_list:
if '04DEM' in path:
new_list.append(path)
Demo:
s = 'C:/directory/NewZealand/04DEM/DEM_CD23_1232.tif'
if '04DEM' in s:
print(True)
# True
Make sure you use / or \\ as directory separator instead of \ because the latter escapes characters.

First, you select via regex using re, and then use pathlib:
import re
import pathlib
pattern = re.compile('04DEM')
# You use pattern.search() if s is IN the string
# You use pattern.match() if s COMPLETELY matches the string.
# Apply the correct function to your use case.
files = [s in list_of_files if pattern.search(s)]
all_pruned_paths = set()
for p in files:
total = ""
for d in pathlib.Path(p):
total = os.path.join(total, d)
if pattern.search(s):
break
all_pruned_paths.add(total)
result = list(all_pruned_paths)
This is more robust than using in because you might need to form more complicated queries in the future.

How to identify files that have increasing numbers and a similar form of filename?

I have a directory of files, some of them image files. Some of those image files are a sequence of images. They could be named image-000001.png, image-000002.png and so on, or perhaps 001_sequence.png, 002_sequence.png etc.
How can we identify images that would, to a human, appear by their names to be fairly obviously in a sequence? This would mean identifying only those image filenames that have increasing numbers and all have a similar form of filename.
The similar part of the filename would not be pre-defined.

You can use a regular expression to get files adhering to a certain pattern, e.g. .*\d+.*\.(jpg|png) for anything, then a number, then more anything, and an image extension.
files = ["image-000001.png", "image-000002.png", "001_sequence.png",
"002_sequence.png", "not an image 1.doc", "not an image 2.doc",
"other stuff.txt", "singular image.jpg"]
import re
image_files = [f for f in files if re.match(r".*\d+.*\.(jpg|png)", f)]
Now, group those image files by replacing the number with some generic string, e.g. XXX:
patterns = collections.defaultdict(list)
for f in image_files:
p = re.sub("\d+", "XXX", f)
patterns[p].append(f)
As a result, patterns is
{'image-XXX.png': ['image-000001.png', 'image-000002.png'],
'XXX_sequence.png': ['001_sequence.png', '002_sequence.png']}
Similarly, it should not be too hard to check whether all those numbers are consecutive, but maybe that's not really necessary after all. Note, however, that this will have problems discriminating numbered series such as "series1_001.jpg", and "series2_001.jpg".

What I would suggest is to use regex trough files and group matching pattern with list of associated numbers from the file-name.
Once this is done, just loop trough the dictionnaries keys and ensure that count of elements is the same that the range of matched numbers.
import re
from collections import defaultdict
from os import listdir
files = listdir("/the/path/")
found_patterns = defaultdict(list)
p = re.compile("(.*?)(\d+)(.*)\.png")
for f in files:
if p.match(f):
s = p.search(f)
pattern = s.group(1) + "___" + s.group(3)
num = int(s.group(2))
found_patterns[pattern].append(num)
for pattern, found in found_patterns.items():
mini, maxi = min(found), max(found)
if len(found) == maxi - mini + 1:
print("Pattern correct: %s" % pattern)
Of course, this will not work if there are some missing value but you can use some acceptance error.

Extract and sort numbers from filnames in python

I have a very basic question. I have files named like Dipole_E0=1.2625E-01.dat and I want to extract the 1.2625E-01 part and finally sort them by ascending order. How can this be done ? I tried first to plit the filename with .split() but it does not what I expect. Thanks for your help.
Best
Roland

Best way is to use regexp. To obtain value from file name:
m = re.search(filename, '^Dipole_E0=(.*)/s?')
val = m.group(0)
Walk through all dilenames and append all values to array. After that sort and that's all.

You want to look into regular expressions. In python they live in the re module. Depending on exact format, something like:
import re
ematch = re.compile("=([0-9]*\.[0-9]*[eE][+-][0-9]+)")
val = ematch.search(filename).group(0)
Sorting a list can be done with the .sort() method on lists, or the sorted(list) builtin, which give you a new list.

This is a good situation to use a generator expression and the sorted builtin:
sorted(float(filename.split("=", 1)[1].rsplit(".", 1)[0]) for filename in filenames)
Where filenames is your list of filenames.
>>> filenames = ["Dipole_E0=1.2625E-01.dat", "Dipole_E0=1.3625E-01.dat", "Dipole_E0=0.2625E-01.dat"]
>>> sorted(float(filename.split("=", 1)[1].rsplit(".", 1)[0]) for filename in filenames)
[0.02625, 0.12625, 0.13625]

You can get the filenames with the glob module.
from glob import glob
file_names = glob("yourpath/*.dat")
vals = []
for name in file_names:
vals.append(float(name[:-4].rpartition("=")[2]))
vals.sort()
name[:-4] throws away the ".dat". rpartition is a string method. It returns a tuple where entry 0 is the string left of the string used to split, entry 1 is the string used to split (here: "=") and entry 2 is the string right of this string (here: your float). Then it is converted to a float and appended to the list of values.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Extracting numbers from a filename string in python - python

Split the filename on underscores and select the third element (index 2). >>> 'Prod224_0055_00007464_20170930.html'.split('_')[2] '00007464' In context that might look like this: nums = [f.split('_')[2] for f in os.listdir(dir) if f.endswith('.html')]

Related

searching specific string in list

python glob to match a wider range

Python: how to search for specific "string" in directory name (not individual file names)

How to identify files that have increasing numbers and a similar form of filename?

Extract and sort numbers from filnames in python

Categories

Resources