Find specific substring while iterating through multiple file names

Find specific substring while iterating through multiple file names - python

I need to find the identification number of a big number of files while iterating throught them.
The file names are loaded onto a list and look like:
ID322198.nii
ID9828731.nii
ID23890.nii
FILEID988312.nii
So the best way to approach this would be to find the number that sits between ID and .nii
Because number of digits varies I can't simply select [-10:-4] of thee file name. Any ideas?

You can use a regex (see it in action here):
import re
files = ['ID322198.nii','ID9828731.nii','ID23890.nii','FILEID988312.nii']
[re.findall(r'ID(\d+)\.nii', file)[0] for file in files]
Returns:
['322198', '9828731', '23890', '988312']

to find the position of ID and .nii, you can use python's index() function
for line in file:
idpos =
nilpos =
data =
or as a list of ints:
[ int(line[line.index("ID")+1:line.index(".nii")]) for line in file ]

Using rindex:
s = 'ID322198.nii'
s = s[s.rindex('D')+1 : s.rindex('.')]
print(s)
Returns:
322198
Then apply this sintax to a list of strings.

It seems like you could filter the digits out, like this:
digits = ''.join(d for d in filename if d.isdigit())
That will work nicely as long as there are no other digits in the filename (e.g backups with a .1 suffix or something).

for name in files:
name = name.replace('.nii', '')
id_num = name.replace(name.rstrip('0123456789'), '')
How this works:
# example
name = 'ID322198.nii'
# remove '.nii'. -> name1 = 'ID322198'
name1 = name.replace('.nii', '')
# strip all digits from the end. -> name2 = 'ID'
name2 = name1.rstrip('0123456789')
# remove 'ID' from 'ID322198'. -> id_num = '322198'
id_num = name1.replace(name2, '')

Related

How to delete paths that contain the same names?

I have a list of paths that look like this (see below). As you can see, file-naming is inconsistent, but I would like to keep only one file per person. I already have a function that removes duplicates if they have the exact same file name but different file extensions, however, with this inconsistent file-naming case it seems trickier.
The list of files looks something like this (but assume there are thousands of paths and words that aren't part of the full names e.g. cv, curriculum vitae etc.):
all_files =
['cv_bob_johnson.pdf',
'bob_johnson_cv.pdf',
'curriculum_vitae_bob_johnson.pdf',
'cv_lara_kroft_cv.pdf',
'cv_lara_kroft.pdf' ]
Desired output:
unique_files = ['cv_bob_johnson.pdf', 'cv_lara_kroft.pdf']
Given that the names are somewhat in a written pattern most of the time (e.g. first name precedes last name), I assume there has to be a way of getting a unique set of the paths if the names are repeated?

If you want to keep your algorithm relatively simple (i.e., not using ML etc), you'll need to have some idea about the typical substrings that you want to remove. Let's make a list of such substrings, for example:
remove = ['cv_', '_cv', 'curriculum_vitae_', '_curriculum_vitae']
Then you can process your list of files this way:
import re
all_files = ['cv_bob_johnson.pdf', 'bob_johnson_cv.pdf', 'curriculum_vitae_bob_johnson.pdf', 'cv_lara_kroft_cv.pdf', 'cv_lara_kroft.pdf']
remove = ['cv_', '_cv', 'curriculum_vitae_', '_curriculum_vitae']
unique = []
for file in all_files:
# strip a suffix, if any:
try:
name, suffix = file.rsplit('.', 1)
except:
name, suffix = file, None
# remove the excess parts:
for rem in remove:
name = re.sub(rem, '', name)
# append the result to the list:
unique.append(f'{name}.{suffix}' if suffix else name)
# remove duplicates:
unique = list(set(unique))
print(unique)

How to extract first part of a file name?

Newbie to Python here. I've been trying to iterate through filenames in a loop and grab the first part of the file name with Python.
My file names are structured as such: "Pitt_0050003_rest.nii.gz". I only want the "Pitt_0050003" part (keep in mind, the file names are various lengths).
Here's the code I've been trying:
fileid = []
for f in dataset:
#print(f)
comp=f.split('/')
fs = (comp[-1]) #get the file name without nii.gz extension
res = re.findall("_rest.nii(\d-)", f) #get the file name without _rest?
if not res: continue
fileid.append(res)
print (fileid)
Any tips?

If all you files will have a '_rest' at the end, then you can try this:
string = "Pitt_0050003_rest.nii.gz."
string = string[:string.index('_rest')]
# Value of string from this line will be Pitt_0050003

You can split by underscore and ignore the last index if your naming convention remains same for all varying filenames.
> myfile = "Pitt_0050003_rest.nii.gz"
> first_name = myfile.split('_')
> first_name
['Pitt', '0050003', 'rest.nii.gz']
> first_name.pop()
'rest.nii.gz'
>
> first_name
['Pitt', '0050003']
>
> '_'.join(first_name)
'Pitt_0050003'
>

Get the full word(s) by knowing only just a part of it

I am searching through a text file line by line and i want to get back all strings that contains the prefix AAAXX1234. For example in my text file i have these lines
Hello my ID is [123423819::AAAXX1234_3412] #I want that(AAAXX1234_3412)
Hello my ID is [738281937::AAAXX1234_3413:AAAXX1234_4212] #I
want both of them(AAAXX1234_3413, AAAXX1234_4212)
Hello my ID is [123423819::XXWWF1234_3098] #I don't care about that
The code i have a just to check if the line starts with "Hello my ID is"
with open(file_hrd,'r',encoding='utf-8') as hrd:
hrd=hrd.readlines()
for line in hrd:
if line.startswith("Hello my ID is"):
#do something

Try this:
import re
with open(file_hrd,'r',encoding='utf-8') as hrd:
res = []
for line in hrd:
res += re.findall('AAAXX1234_\d+', line)
print(res)
Output:
['AAAXX1234_3412', 'AAAXX1234_3413', 'AAAXX1234_4212']

I’d suggest you to parse your lines and extract the information into meaningful parts. That way, you can then use a simple startswith on the ID part of your line. In addition, this will also let you control where you find these prefixes, e.g. in case the lines contains additional data that could also theoretically contain something that looks like an ID.
Something like this:
if line.startswith('Hello my ID is '):
idx_start = line.index('[')
idx_end = line.index(']', idx_start)
idx_separator = line.index(':', idx_start, idx_end)
num = line[idx_start + 1:idx_separator]
ids = line[idx_separator + 2:idx_end].split(':')
print(num, ids)
This would give you the following output for your three example lines:
123423819 ['AAAXX1234_3412']
738281937 ['AAAXX1234_3413', 'AAAXX1234_4212']
123423819 ['XXWWF1234_3098']
With that information, you can then check the ids for a prefix:
if any(ids, lambda x: x.startswith('AAAXX1234')):
print('do something')

Using regular expressions through the re module and its findall() function should be enough:
import re
with open('file.txt') as file:
prefix = 'AAAXX1234'
lines = file.read().splitlines()
output = list()
for line in lines:
output.extend(re.findall(f'{prefix}_[\d]+', line))

You can do it by findall with the regex r'AAAXX1234_[0-9]+', it will find all parts of the string that start with AAAXX1234_ and then grabs all of the numbers after it, change + to * if you want it to match 'AAAXX1234_' on it's own as well

split based on multiple numbers in python

Can you help me figure out how to split based on multiple/group of number as delimiter?
I have content in a file in below format:
data_file_10572_2018-02-15-12-57-29.file
header_file_13238_2018-02-15-12-57-48.file
sig_file1_17678_2018-02-15-12-57-14.file
Expected output:
data_file
header_file
sig_file1
I'm new to python and I'm not sure how to cut based on group of number. Thanks for the reply!!

I hope this will help you. Method finds the element that can be casted to integer and return a string up to this value.
data = ['data_file_10572_2018-02-15-12-57-29.file', 'header_file_13238_2018-02-15-12-57-48.file', 'sig_file1_17678_2018-02-15-12-57-14.file']
def split_before_int(elem):
filename = elem.split('_')
for part in filename:
if not isinstance(part, (int)):
return '_'.join(filename[:filename.index(part)-2])
for elem in data:
print(split_before_int(elem))
Output:
data_file
header_file
sig_file1

First index to get the second location of the _ symbol, then python list partial indexing (i.e. list[0:5]) to get a substring up to the location of the second _.

files = ['data_file_10572_2018-02-15-12-57-29.file', 'header_file_13238_2018-02-15-12-57-48.file','sig_file1_17678_2018-02-15-12-57-14.file']
cleaned_files = list(map(lambda file: '_'.join(file.split('_')[0:2]), files))
This results in:
['data_file', 'header_file', 'sig_file1']

You can use the split by "_" with regex and then join the elements excluding the last
Ex:
import re
a = "data_file_10572_2018-02-15-12-57-29.file"
print "_".join(re.match("(.*?)_\d",a).group().split("_")[:-1])
output:
data_file

This code will work if all you filenames follow the pattern you described.
filename = 'data_file_10572_2018-02-15-12-57-29.file'
parts = filename.split('_')
new_filename = '_'.join(parts[:2])
If alphabetical part fo file name has variable number of underscores it's better to use Regex.
import re
pattern = re.compile('_[0-9_-]{3,}.file$')
re.sub(pattern, '', filename)
Output:
data_file
Essentially, first, it creates a pattern that starts with _, followed by 3 or more numbers, _ or - and ends with .file.
Then you replace the largest substring of you string that follows this pattern with an empty string.

stripping a pattern from the end of the string

I want to see if a file like test_100.webp exists and then look at the file test.yaml. Therefore, I need to strip the pattern "_100.webp" from the end. I tried to use the code below and it is giving me issues.
for i, image in enumerate(images_in_item):
if image.endswith("_100.webp"):
image_strip = image.rstrip(_100.webp)
snapshot_markup = os.path.join(image_strip + 'yaml')

Do this:
suffix = '_100.webp'
if image.endswith(suffix):
image_strip = image[:-len(suffix)]
snapshot_markup = os.path.join(image_strip + 'yaml')

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Find specific substring while iterating through multiple file names - python

You can use a regex (see it in action here): import re files = ['ID322198.nii','ID9828731.nii','ID23890.nii','FILEID988312.nii'] [re.findall(r'ID(\d+)\.nii', file)[0] for file in files] Returns: ['322198', '9828731', '23890', '988312']

to find the position of ID and .nii, you can use python's index() function for line in file: idpos = nilpos = data = or as a list of ints: [ int(line[line.index("ID")+1:line.index(".nii")]) for line in file ]

Using rindex: s = 'ID322198.nii' s = s[s.rindex('D')+1 : s.rindex('.')] print(s) Returns: 322198 Then apply this sintax to a list of strings.

It seems like you could filter the digits out, like this: digits = ''.join(d for d in filename if d.isdigit()) That will work nicely as long as there are no other digits in the filename (e.g backups with a .1 suffix or something).

Related

How to delete paths that contain the same names?

How to extract first part of a file name?

Get the full word(s) by knowing only just a part of it

split based on multiple numbers in python

stripping a pattern from the end of the string

Categories

Resources