python regex: Parsing file name

python regex: Parsing file name - python

I have a text file (filenames.txt) that contains the file name with its file extension.
filename.txt
[AW] One Piece - 629 [1080P][Dub].mkv
EP.585.1080p.mp4
EP609.m4v
EP 610.m4v
One Piece 0696 A Tearful Reunion! Rebecca and Kyros!.mp4
One_Piece_0745_Sons'_Cups!.mp4
One Piece - 591 (1080P Funi Web-Dl -Ks-)-1.m4v
One Piece - 621 1080P.mkv
One_Piece_S10E577_Zs_Ambition_A_Great_and_Desperate_Escape_Plan.mp4
these are the example filename and its extension. I need to rename filename with the episode number (without changing its extension).
Example:
Input:
``````
EP609.m4v
EP 610.m4v
EP.585.1080p.mp4
One Piece - 621 1080P.mkv
[AW] One Piece - 629 [1080P][Dub].mkv
One_Piece_0745_Sons'_Cups!.mp4
One Piece 0696 A Tearful Reunion! Rebecca and Kyros!.mp4
One Piece - 591 (1080P Funi Web-Dl -Ks-)-1.m4v
One_Piece_S10E577_Zs_Ambition_A_Great_and_Desperate_Escape_Plan.mp4
Expected Output:
````````````````
609.m4v
610.m4v
585.mp4
621.mkv
629.mkv
745.mp4 (or) 0745.mp4
696.mp4 (or) 0696.mp4
591.m4v
577.mp4
Hope someone will help me parse and rename these filenames. Thanks in advance!!!

As you tagged python, I guess you are willing to use python.
(Edit: I've realized a loop in my original code is unnecessary.)
import re
with open('filename.txt', 'r') as f:
files = f.read().splitlines() # read filenames
# assume: an episode comprises of 3 digits possibly preceded by 0
p = re.compile(r'0?(\d{3})')
for file in files:
if m := p.search(file):
print(m.group(1) + '.' + file.split('.')[-1])
else:
print(file)
This will output
609.m4v
610.m4v
585.mp4
621.mkv
629.mkv
745.mp4
696.mp4
591.m4v
577.mp4
Basically, it searches for the first 3-digit number, possibly preceded by 0.
I strongly advise you to check the output; in particular, you would want to run sort OUTPUTFILENAME | uniq -d to see whether there are duplicate target names.
(Original answer:)
p = re.compile(r'\d{3,4}')
for file in files:
for m in p.finditer(file):
ep = m.group(0)
if int(ep) < 1000:
print(ep.lstrip('0') + '.' + file.split('.')[-1])
break # go to next file if ep found (avoid the else clause)
else: # if ep not found, just print the filename as is
print(file)

Program to parse episode number and renaming it.
Modules used:
re - To parse File Name
os - To rename File Name
full/path/to/folder - is the path to the folder where your file lives
import re
import os
for file in os.listdir(path="full/path/to/folder/"):
# searches for the first 3 or 4 digit number less than 1000 for each line.
for match_obj in re.finditer(r'\d{3,4}', file):
episode = match_obj.group(0)
if int(episode) < 1000:
new_filename = episode.lstrip('0') + '.' + file.split('.')[-1]
old_name = "full/path/to/folder/" + file
new_name = "full/path/to/folder/" + new_filename
os.rename(old_name, new_name)
# go to next file if ep found (avoid the else clause)
break
else:
# if episode not found, just leave the filename as it is
pass

Related

Python - Check for exact string in file name

I have a folder where each file is named after a number (i.e. img 1, img 2, img-3, 4-img, etc). I want to get files by exact string (so if I enter '4' as an input, it should only return files with '4' and not any files containing '14' or 40', for example. My problem is that the program returns all files as long as it matches the string. Note, the numbers aren't always in the same spot (for same files its at the end, for others it's in the middle)
For instance, if my folder has the files ['ep 4', 'xxx 3 ', 'img4', '4xxx', 'ep-40', 'file.mp4', 'file 4.mp4', 'ep.4.', 'ep.4 ', 'ep. 4. ',ep4xxx, 'ep 4 ', '404ep'],and I want only files with the exact number 4 in them, then I would only want to return ['ep 4', 'img4', '4xxx','file 4.mp4','ep.4.','ep.4 ', 'ep. 4. ',ep4xxx,'ep 4 ','404ep]
here is what I have (in this case I only want to return all mp4 file type)
for (root, dirs, file) in os.walk(source_folder):
for f in file:
if '.mp4' and ('4') in f:
print(f)
Tried == instead of in

Judging by your inputs, your desired regular expression needs to meet the following criteria:
Match the number provided, exactly
Ignore number matches in the file extension, if present
Handle file names that include spaces
I think this will meet all these requirements:
def generate(n):
return re.compile(r'^[^.\d]*' + str(n) + r'[^.\d]*(\..*)?$')
def check_files(n, files):
regex = generate(n)
return [f for f in files if regex.fullmatch(f)]
Usage:
>>> check_files(4, ['ep 4', 'xxx 3 ', 'img4', '4xxx', 'ep-40', 'file.mp4', 'file 4.mp4'])
['ep 4', 'img4', '4xxx', 'file 4.mp4']
Note that this solution involves creating a Pattern object and using that object to check each file. This strategy offers a performance benefit over calling re.fullmatch with the pattern and filename directly, as the pattern does not have to be compiled for each call.
This solution does have one drawback: it assumes that filenames are formatted as name.extension and that the value you're searching for is in the name part. Because of the greedy nature of regular expressions, if you allow for file names with . then you won't be able to exclude extensions from the search. Ergo, modifying this to match ep.4 would also cause it to match file.mp4. That being said, there is a workaround for this, which is to strip extensions from the file name before doing the match:
def generate(n):
return re.compile(r'^[^\d]*' + str(n) + r'[^\d]*$')
def strip_extension(f):
return f.removesuffix('.mp4')
def check_files(n, files):
regex = generate(n)
return [f for f in files if regex.fullmatch(strip_extension(f))]
Note that this solution now includes the . in the match condition and does not exclude an extension. Instead, it relies on preprocessing (the strip_extension function) to remove any file extensions from the filename before matching.
As an addendum, occasionally you'll get files have the number prefixed with zeroes (ex. 004, 0001, etc.). You can modify the regular expression to handle this case as well:
def generate(n):
return re.compile(r'^[^\d]*0*' + str(n) + r'[^\d]*$')

We can use re.search along with a list comprehension for a regex option:
files = ['ep 4', 'xxx 3 ', 'img4', '4xxx', 'ep-40', 'file.mp4', 'file 4.mp4']
num = 4
regex = r'(?<!\d)' + str(num) + r'(?!\d)'
output = [f for f in files if re.search(regex, f)]
print(output) # ['ep 4', 'img4', '4xxx', 'file.mp4', 'file 4.mp4']

this can be accomplished with the following function
import os
files = ["ep 4", "xxx 3 ", "img4", "4xxx", "ep-40", "file.mp4", "file 4.mp4"]
desired_output = ["ep 4", "img4", "4xxx", "file 4.mp4"]
def number_filter(files, number):
filtered_files = []
for file_name in files:
# if the number is not present, we can skip this file
if file_name.count(str(number)) == 0:
continue
# if the number is present in the extension, but not in the file name, we can skip this file
name, ext = os.path.splitext(file_name)
if (
isinstance(ext, str)
and ext.count(str(number)) > 0
and isinstance(name, str)
and name.count(str(number)) == 0
):
continue
# if the number is preseent in the file name, we must determine if it's part of a different number
num_index = file_name.index(str(number))
# if the number is at the beginning of the file name
if num_index == 0:
# check if the next character is a digit
if file_name[num_index + len(str(number))].isdigit():
continue
# if the number is at the end of the file name
elif num_index == len(file_name) - len(str(number)):
# check if the previous character is a digit
if file_name[num_index - 1].isdigit():
continue
# if it's somewhere in the middle
else:
# check if the previous and next characters are digits
if (
file_name[num_index - 1].isdigit()
or file_name[num_index + len(str(number))].isdigit()
):
continue
print(file_name)
filtered_files.append(file_name)
return filtered_files
output = number_filter(files, 4)
for file in output:
assert file in desired_output
for file in desired_output:
assert file in output

Add part of line found after #solution to file

I have a script that puts the line that starts with #Solution 1 in a new file together with the name of the input file. But I want to add the piece belonging to Major from the input file. Can someone please help me to figure out how to get the piece of text?
The script now:
#!/usr/bin/env python3
import os
dr = "/home/nwalraven/Result_pgx/Runfolder/Runres_Aldy" outdr = "/home/nwalraven/Result_pgx/Runfolder/Aldy_res_txt" tag = ".aldy"
for f in os.listdir(dr):
if f.endswith(tag):
print(f)
new_file_name = f.split('_')[0]+'.txt' # get the name of the file before the '_' and add '.txt' to it
with open(dr+"/"+f) as file:
for line in file.readlines():
f
if line.startswith("#Solution 1"):
with open(outdr+"/"+new_file_name,"a",newline='\n') as new_file:
new_file.write(f.split('.')[0] + "\n")
new_file.write(line + "\n")
if line.startswith("#Solution 2"):
with open(outdr+"/"+new_file_name,"a",newline='\n') as new_file:
new_file.write(line + "\n")
print("Meerdere oplossingen gevonden! Check Aldy bestand" )
The input:
file = EMQN3-S3_COMT.aldy
#Sample Gene SolutionID Major Minor Copy Allele Location Type Coverage Effect dbSNP Code Status
#Solution 1: *Met, *ValB
EMQN3-S3 COMT 1 *Met/*ValB Met;ValB 0 Met 19950234 C>T 530 H62= rs4633
EMQN3-S3 COMT 1 *Met/*ValB Met;ValB 0 Met 19951270 G>A 651 V158M rs4680
EMQN3-S3 COMT 1 *Met/*ValB Met;ValB 1 ValB
file = EMQN3-S3_CYP2B6.aldy
#Sample Gene SolutionID Major Minor Copy Allele Location Type Coverage Effect dbSNP Code Status
#Solution 1: *1.001, *1.001
EMQN3-S3 CYP2B6 1 *1/*1 1.001;1.001 0 1.001
EMQN3-S3 CYP2B6 1 *1/*1 1.001;1.001 1 1.001
The result it gives right now:
EMQN3-S3_COMT.aldy
#Solution 1: *Met, *ValB
EMQN3-S3_CYP2B6.aldy
#Solution 1: *1.001, *1.001
The result I need:
EMQN3-S3_COMT.aldy
#Solution 1: *Met/*ValB
EMQN3-S3_CYP2B6.aldy
#Solution 1: *1/*1

If you print out the line, you could use regular expression to replace text before printing the line.
On the other hand, if you know it always starts with a fixed number of chars, then it's easier and faster to edit the line manually.
With regex:
# Importing regular expressions
import re
# Setting up regex replacement to replace ", " with "/"
regex = "\, "
replacement = "/"
...
# Format the line before printing it
line_formatted = re.sub(regex, replacement, line)
new_file.write(line.replace(regex, replacement) + "\n") # edited
...

Try to replace this part of your script:
...
if line.startswith("#Solution 1"):
with open(outdr+"/"+new_file_name,"a",newline='\n') as new_file:
new_file.write(f.split('.')[0] + "\n")
solution = "/".join([x.strip().split(".")[0] for x in line.split(",")])
new_file.write(solution + "\n")
...
It will do the following:
split the string into two tokens, based on the comma
strip them
remove the decimal part (if any) from the token
rejoin the string using the slash.
Hope it helps.

Bulk rename txt files with different parts using Python

I have a list of files that I wish to rename to.
Receipt ABC-001 623572349-1.txt --> Receipt ABC-001A.txt
Receipt ABC-001 623572349-2.txt --> Receipt ABC-001B.txt
However, even at the first step, everytime I get the following error "Cannot create a file when that file already exists:". What would be the best option to achieve the above outcome where files ending with 1 will become A; ending with 5.txt will become E.txt, and soforth?
Below is the code I have used:
import os, fnmatch
#Set directory of locataion; include double slash for each subfolder.
file_path = "C:\\Users\\Mr.Slowbro\\Desktop\\TBU\\"
#Set file extension accordingly
files_to_rename = fnmatch.filter(os.listdir(file_path), '*.txt')
for file_name in files_to_rename:
file_name_new = file_name[-5:5]
os.rename(file_path + file_name, file_path + file_name_new)

This should help you out. Using the ord() function returns the Unicode point of a character. So 'a' would be 97, 'b' would be 98, etc. Likewise, chr() returns the character of that Unicode point. So, I think the code below will help you with your issue.
#Set directory of locataion; include double slash for each subfolder.
file_path = "C:\\Users\\Mr.Slowbro\\Desktop\\TBU\\"
#Set file extension accordingly
files_to_rename = fnmatch.filter(os.listdir(file_path), '*.txt')
for file_name in files_to_rename:
number = chr(int(file_name[-5]) - 1 + ord('A'))
file_name_new = 'Receipt ABC-001' + number + '.txt'
os.rename(file_name, file_name_new)```

fnmatch working inconsistently in different scripts

I am working on a python script that will write input files for an analysis program I use. One of the steps is to take a list of filenames and search the input directory for them, open them, and get some information out of them. I wrote the following using os.walk and fnmatch in a test-script that has the directory of interest hard-coded in, and it worked just fine:
for locus in loci_select: # for each locus we'll include
print("Finding file " + locus)
for root, dirnames, filenames in os.walk('../phylip_wigeon_mid1_names'):
for filename in fnmatch.filter(filenames, locus): # look in the input directory
print("Found file for locus " + locus + " in set")
loci_file = open(os.path.join('../phylip_wigeon_mid1_names/', filename))
with loci_file as f:
for i, l in enumerate(f):
pass
count = (i) * 0.5 # how many individuals present
print(filename + "has sequences for " + str(count) + " individuals")
...and so on (the other bits all work, so I'll spare you).
As soon as I put this into the larger script and switch out the directory names for input arguments, though, it seems to stop working between the third and fourth lines, despite being nearly identical:
for locus in use_loci: # for each locus we'll include
log.info("Finding file " + locus)
for root, dirnames, filenames in os.walk(args.input_dir):
for filename in fnmatch.filter(filenames, locus): # look in the input directory
log.info("Found file for locus " + locus + " in set")
loci_file = open(os.path.join(args.input_dir, filename))
with loci_file as f:
for i, l in enumerate(f):
pass
count = (i) * 0.5 # how many individuals present
log.info(filename + "has sequences for " + str(count) + " individuals")
I've tested it with temporary print statements between the suspected lines, and it seems like they are the culprits, since my screen output looks like:
2015-11-17 15:53:20,505 - write_ima2p_input_file - INFO - Getting selected loci for analysis
2015-11-17 15:53:20,505 - write_ima2p_input_file - INFO - Finding file uce-7999_wigeon_mid1_contigs.phy
2015-11-17 15:53:20,629 - write_ima2p_input_file - INFO - Finding file uce-4686_wigeon_mid1_contigs.phy
2015-11-17 15:53:20,647 - write_ima2p_input_file - INFO - Finding file uce-5012_wigeon_mid1_contigs.phy
...and so on.
I've tried switching out to glob, as well as simple things like rearranging where this section falls in my larger code, but nothing is working. Any insight would be much appreciated!

Automator/Applescript rename files if

I have a large list of images that have been misnamed by my artist. I was hoping to avoid giving him more work by using Automator but I'm new to it. Right now they're named in order what001a and what002a but that should be what001a and what001b. So basically odd numbered are A and even numbered at B. So i need a script that changes the even numbered to B images and renumbers them all to the proper sequential numbering. How would I go about writing that script?

A small Ruby script embedded in an AppleScript provides a very comfortable solution, allowing you to select the files to rename right in Finder and displaying an informative success or error message.
The algorithm renames files as follows:
number = first 3 digits in filename # e.g. "006"
letter = the letter following those digits # e.g. "a"
if number is even, change letter to its successor # e.g. "b"
number = (number + 1)/2 # 5 or 6 => 3
replace number and letter in filename
And here it is:
-- ask for files
set filesToRename to choose file with prompt "Select the files to rename" with multiple selections allowed
-- prepare ruby command
set ruby_script to "ruby -e \"s=ARGV[0]; m=s.match(/(\\d{3})(\\w)/); n=m[1].to_i; a=m[2]; a.succ! if n.even?; r=sprintf('%03d',(n+1)/2)+a; puts s.sub(/\\d{3}\\w/,r);\" "
tell application "Finder"
-- process files, record errors
set counter to 0
set errors to {}
repeat with f in filesToRename
try
do shell script ruby_script & (f's name as text)
set f's name to result
set counter to counter + 1
on error
copy (f's name as text) to the end of errors
end try
end repeat
-- display report
set msg to (counter as text) & " files renamed successfully!\n"
if errors is not {} then
set AppleScript's text item delimiters to "\n"
set msg to msg & "The following files could NOT be renamed:\n" & (errors as text)
set AppleScript's text item delimiters to ""
end if
display dialog msg
end tell
Note that it will fail when the filename contains spaces.

A friend of mine wrote a Python script to do what I needed. Figured I'd post it here as an answer for anyone stumbling upon a similar problem looking for help. It is in Python though so if anyone wants to convert it to AppleScript for those that may need it go for it.
import os
import re
import shutil
def toInt(str):
try:
return int(str)
except:
return 0
filePath = "./"
extension = "png"
dirList = os.listdir(filePath)
regx = re.compile("[0-9]+a")
for filename in dirList:
ext = filename[-len(extension):]
if(ext != extension): continue
rslts = regx.search(filename)
if(rslts == None): continue
pieces = regx.split(filename)
if(len(pieces) < 2): pieces.append("")
filenumber = toInt(rslts.group(0).rstrip("a"))
newFileNum = (filenumber + 1) / 2
fileChar = "b"
if(filenumber % 2): fileChar = "a"
newFileName = "%s%03d%s%s" % (pieces[0], newFileNum, fileChar, pieces[1])
shutil.move("%s%s" % (filePath, filename), "%s%s" % (filePath, newFileName))

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

python regex: Parsing file name - python

Related

Python - Check for exact string in file name

Add part of line found after #solution to file

Bulk rename txt files with different parts using Python

fnmatch working inconsistently in different scripts

Automator/Applescript rename files if

Categories

Resources