Remove punctuation from file name while keeping file extension intact

Remove punctuation from file name while keeping file extension intact - python

I would like to remove all punctuation from a filename but keep its file extension intact.
e.g. I want:
Flowers.Rose-Murree-[25.10.11].jpg
Time.Square.New-York-[20.7.09].png
to look like:
Flowers Rose Muree 25 10 11.jpg
Time Square New York 20 7 09.png
I'm trying python:
re.sub(r'[^A-Za-z0-9]', ' ', filename)
But that produces:
Flowers Rose Muree 25 10 11 jpg
Time Square New York 20 7 09 png
How do I remove the punctuation but keep the file extension?

There's only one right way to do this:
os.path.splitext to get the filename and the extension
Do whatever processing you want to the filename.
Concatenate the new filename with the extension.

You could use a negative lookahead, that asserts that you are not dealing with a dot that is only followed by digits and letters:
re.sub(r'(?!\.[A-Za-z0-9]*$)[^A-Za-z0-9]', ' ', filename)

I suggest you to replace each occurrence of [\W_](?=.*\.) with space .

See if this works for you. You can actually do it without Regex
>>> fname="Flowers.Rose-Murree-[25.10.11].jpg"
>>> name,ext=os.path.splitext(fname)
>>> name = name.translate(None,string.punctuation)
>>> name += ext
>>> name
'FlowersRoseMurree251011.jpg'
>>>

#katrielalex beat me to the type of answer, but anyway, a regex-free solution:
In [23]: f = "/etc/path/fred.apple.png"
In [24]: path, filename = os.path.split(f)
In [25]: main, suffix = os.path.splitext(filename)
In [26]: newname = os.path.join(path,''.join(c if c.isalnum() else ' ' for c in main) + suffix)
In [27]: newname
Out[27]: '/etc/path/fred apple.png'

Related

Python - Check for exact string in file name

I have a folder where each file is named after a number (i.e. img 1, img 2, img-3, 4-img, etc). I want to get files by exact string (so if I enter '4' as an input, it should only return files with '4' and not any files containing '14' or 40', for example. My problem is that the program returns all files as long as it matches the string. Note, the numbers aren't always in the same spot (for same files its at the end, for others it's in the middle)
For instance, if my folder has the files ['ep 4', 'xxx 3 ', 'img4', '4xxx', 'ep-40', 'file.mp4', 'file 4.mp4', 'ep.4.', 'ep.4 ', 'ep. 4. ',ep4xxx, 'ep 4 ', '404ep'],and I want only files with the exact number 4 in them, then I would only want to return ['ep 4', 'img4', '4xxx','file 4.mp4','ep.4.','ep.4 ', 'ep. 4. ',ep4xxx,'ep 4 ','404ep]
here is what I have (in this case I only want to return all mp4 file type)
for (root, dirs, file) in os.walk(source_folder):
for f in file:
if '.mp4' and ('4') in f:
print(f)
Tried == instead of in

Judging by your inputs, your desired regular expression needs to meet the following criteria:
Match the number provided, exactly
Ignore number matches in the file extension, if present
Handle file names that include spaces
I think this will meet all these requirements:
def generate(n):
return re.compile(r'^[^.\d]*' + str(n) + r'[^.\d]*(\..*)?$')
def check_files(n, files):
regex = generate(n)
return [f for f in files if regex.fullmatch(f)]
Usage:
>>> check_files(4, ['ep 4', 'xxx 3 ', 'img4', '4xxx', 'ep-40', 'file.mp4', 'file 4.mp4'])
['ep 4', 'img4', '4xxx', 'file 4.mp4']
Note that this solution involves creating a Pattern object and using that object to check each file. This strategy offers a performance benefit over calling re.fullmatch with the pattern and filename directly, as the pattern does not have to be compiled for each call.
This solution does have one drawback: it assumes that filenames are formatted as name.extension and that the value you're searching for is in the name part. Because of the greedy nature of regular expressions, if you allow for file names with . then you won't be able to exclude extensions from the search. Ergo, modifying this to match ep.4 would also cause it to match file.mp4. That being said, there is a workaround for this, which is to strip extensions from the file name before doing the match:
def generate(n):
return re.compile(r'^[^\d]*' + str(n) + r'[^\d]*$')
def strip_extension(f):
return f.removesuffix('.mp4')
def check_files(n, files):
regex = generate(n)
return [f for f in files if regex.fullmatch(strip_extension(f))]
Note that this solution now includes the . in the match condition and does not exclude an extension. Instead, it relies on preprocessing (the strip_extension function) to remove any file extensions from the filename before matching.
As an addendum, occasionally you'll get files have the number prefixed with zeroes (ex. 004, 0001, etc.). You can modify the regular expression to handle this case as well:
def generate(n):
return re.compile(r'^[^\d]*0*' + str(n) + r'[^\d]*$')

We can use re.search along with a list comprehension for a regex option:
files = ['ep 4', 'xxx 3 ', 'img4', '4xxx', 'ep-40', 'file.mp4', 'file 4.mp4']
num = 4
regex = r'(?<!\d)' + str(num) + r'(?!\d)'
output = [f for f in files if re.search(regex, f)]
print(output) # ['ep 4', 'img4', '4xxx', 'file.mp4', 'file 4.mp4']

this can be accomplished with the following function
import os
files = ["ep 4", "xxx 3 ", "img4", "4xxx", "ep-40", "file.mp4", "file 4.mp4"]
desired_output = ["ep 4", "img4", "4xxx", "file 4.mp4"]
def number_filter(files, number):
filtered_files = []
for file_name in files:
# if the number is not present, we can skip this file
if file_name.count(str(number)) == 0:
continue
# if the number is present in the extension, but not in the file name, we can skip this file
name, ext = os.path.splitext(file_name)
if (
isinstance(ext, str)
and ext.count(str(number)) > 0
and isinstance(name, str)
and name.count(str(number)) == 0
):
continue
# if the number is preseent in the file name, we must determine if it's part of a different number
num_index = file_name.index(str(number))
# if the number is at the beginning of the file name
if num_index == 0:
# check if the next character is a digit
if file_name[num_index + len(str(number))].isdigit():
continue
# if the number is at the end of the file name
elif num_index == len(file_name) - len(str(number)):
# check if the previous character is a digit
if file_name[num_index - 1].isdigit():
continue
# if it's somewhere in the middle
else:
# check if the previous and next characters are digits
if (
file_name[num_index - 1].isdigit()
or file_name[num_index + len(str(number))].isdigit()
):
continue
print(file_name)
filtered_files.append(file_name)
return filtered_files
output = number_filter(files, 4)
for file in output:
assert file in desired_output
for file in desired_output:
assert file in output

python regex: Parsing file name

I have a text file (filenames.txt) that contains the file name with its file extension.
filename.txt
[AW] One Piece - 629 [1080P][Dub].mkv
EP.585.1080p.mp4
EP609.m4v
EP 610.m4v
One Piece 0696 A Tearful Reunion! Rebecca and Kyros!.mp4
One_Piece_0745_Sons'_Cups!.mp4
One Piece - 591 (1080P Funi Web-Dl -Ks-)-1.m4v
One Piece - 621 1080P.mkv
One_Piece_S10E577_Zs_Ambition_A_Great_and_Desperate_Escape_Plan.mp4
these are the example filename and its extension. I need to rename filename with the episode number (without changing its extension).
Example:
Input:
``````
EP609.m4v
EP 610.m4v
EP.585.1080p.mp4
One Piece - 621 1080P.mkv
[AW] One Piece - 629 [1080P][Dub].mkv
One_Piece_0745_Sons'_Cups!.mp4
One Piece 0696 A Tearful Reunion! Rebecca and Kyros!.mp4
One Piece - 591 (1080P Funi Web-Dl -Ks-)-1.m4v
One_Piece_S10E577_Zs_Ambition_A_Great_and_Desperate_Escape_Plan.mp4
Expected Output:
````````````````
609.m4v
610.m4v
585.mp4
621.mkv
629.mkv
745.mp4 (or) 0745.mp4
696.mp4 (or) 0696.mp4
591.m4v
577.mp4
Hope someone will help me parse and rename these filenames. Thanks in advance!!!

As you tagged python, I guess you are willing to use python.
(Edit: I've realized a loop in my original code is unnecessary.)
import re
with open('filename.txt', 'r') as f:
files = f.read().splitlines() # read filenames
# assume: an episode comprises of 3 digits possibly preceded by 0
p = re.compile(r'0?(\d{3})')
for file in files:
if m := p.search(file):
print(m.group(1) + '.' + file.split('.')[-1])
else:
print(file)
This will output
609.m4v
610.m4v
585.mp4
621.mkv
629.mkv
745.mp4
696.mp4
591.m4v
577.mp4
Basically, it searches for the first 3-digit number, possibly preceded by 0.
I strongly advise you to check the output; in particular, you would want to run sort OUTPUTFILENAME | uniq -d to see whether there are duplicate target names.
(Original answer:)
p = re.compile(r'\d{3,4}')
for file in files:
for m in p.finditer(file):
ep = m.group(0)
if int(ep) < 1000:
print(ep.lstrip('0') + '.' + file.split('.')[-1])
break # go to next file if ep found (avoid the else clause)
else: # if ep not found, just print the filename as is
print(file)

Program to parse episode number and renaming it.
Modules used:
re - To parse File Name
os - To rename File Name
full/path/to/folder - is the path to the folder where your file lives
import re
import os
for file in os.listdir(path="full/path/to/folder/"):
# searches for the first 3 or 4 digit number less than 1000 for each line.
for match_obj in re.finditer(r'\d{3,4}', file):
episode = match_obj.group(0)
if int(episode) < 1000:
new_filename = episode.lstrip('0') + '.' + file.split('.')[-1]
old_name = "full/path/to/folder/" + file
new_name = "full/path/to/folder/" + new_filename
os.rename(old_name, new_name)
# go to next file if ep found (avoid the else clause)
break
else:
# if episode not found, just leave the filename as it is
pass

Accessing a file path saved in a .txt file. (Python)

I have a text file that contains file paths of files that I wish to open.
The text file looks like this:
28.2 -1.0 46 14 10 .\temp_109.17621\voltage_28.200\power_-1.txt
28.2 -2.0 46 16 10 .\temp_109.17621\voltage_28.200\power_-2.txt
...
I would like to open the files at this filepath.
First step is to load each filepath from the text file.
I've tried this using:
path = np.loadtxt('NonLorentzianData.txt',usecols=[5],dtype='S16')
which generates a path[1] that looks like:
.\\temp_109.17621
...
rather than the entire file path.
Am I using the wrong dtype or is this not possible with loadtxt?

You use S16 as a type and get .\\temp_109.17621 as an result (\\ is escaped \) and was returned string with length=16.
Try to use np.genfromtxt and dtype=None or properly adjust dtype='S45' (in your case)
Inspired by post

If you change the data type to np.str_ it will work:
path = np.loadtxt('NonLorentzianData.txt',usecols=[5],dtype=np.str_)
print(path[1])
.\temp_109.17621\voltage_28.200\power_-2.txt
Or using dtype=("S44") will also work which is the length of your longest of the two paths.
You are specifying a 16 character string so you only get the first 16 characters.
In [17]: s = ".\\temp_109.17621"
In [18]: len(s)
Out[18]: 16
# 43 character string
In [26]: path = np.loadtxt('words.txt',usecols=[5],dtype=("S43"))
In [27]: path[1]
Out[27]: '.\\temp_109.17621\\voltage_28.200\\power_-2.tx'
In [28]: len(path[1])
Out[28]: 43
# 38 character string
In [29]: path = np.loadtxt('words.txt',usecols=[5],dtype=("S38"))
In [30]: path[1]
Out[30]: '.\\temp_109.17621\\voltage_28.200\\power_'
In [31]: len(path[1])
Out[31]: 38
In [32]: path = np.loadtxt('words.txt',usecols=[5],dtype=np.str_)
In [33]: path[1]
Out[33]: '.\\temp_109.17621\\voltage_28.200\\power_-2.txt'
If you look at the docs you will see what every dtype does and how to use them.
If you just want all the file paths you can also use csv.reader:
import csv
with open("NonLorentzianData.txt") as f:
reader = csv.reader(f,delimiter=" ")
for row in reader:
with open(row[-1]) as f:
.....

Getting NoneType Error When Using Regex to Change Filenames in Python

I'm trying to use change a bunch of filenames using regex groups but can't seem to get it to work (despite writing what regexr.com tells me should be a valid regex statement). The 93,000 files I currently have all look something like this:
Mr. McCONNELL.2012-07-31.2014sep19_at_182325.txt
Mrs. HAGAN.2012-12-06.2014sep19_at_182321.txt
Ms. MURRAY.2012-06-18.2014sep19_at_182246.txt
The PRESIDING OFFICER.2012-12-06.2014sep19_at_182320.txt
And I want them to look like this:
20120731McCONNELL2014sep19_at_182325.txt
And ignore any file that starts with anything other than Mr., Mrs., and Ms.
But every time I run the script below, I get the following error:
Traceback (most recent call last):
File "changefilenames.py", line 11, in <module>
date = m.group(2)
AttributeError: 'NoneType' object has no attribute 'group'
Thanks so much for your help. My apologies if this is a silly question. I'm just starting with RegEx and Python and can't seem to figure this one out.
import io
import os
import re
from dateutil.parser import parse
for filename in os.listdir("/Users/jolijttamanaha/Desktop/thesis2/Republicans/CRspeeches"):
if filename.startswith("Mr."):
m = re.search("Mr.\s(\w*).(\d\d\d\d\-\d\d\-\d\d).(\w*).txt", filename)
date = m.group(2)
name = m.group(1)
timestamp = m.group(3)
dt = parse(date)
new_filename = "{dt.year}.{dt.month}.{dt.day}".format(dt=dt) + name + timestamp + ".txt"
os.rename(filename, new_filename)
print new_filename
print "All done with the Mr"
if filename.startswith("Mrs."):
m = re.search("Mrs.\s(\w*).(\d\d\d\d\-\d\d\-\d\d).(\w*).txt", filename)
date = m.group(2)
name = m.group(1)
timestamp = m.group(3)
dt = parse(date)
new_filename = "{dt.year}.{dt.month}.{dt.day}".format(dt=dt) + name + timestamp + ".txt"
os.rename(filename, new_filename)
print new_filename
print "All done with the Mrs"
if filename.startswith("Ms."):
m = re.search("Ms.\s(\w*).(\d\d\d\d\-\d\d\-\d\d).(\w*).txt", filename)
date = m.group(2)
name = m.group(1)
timestamp = m.group(3)
dt = parse(date)
new_filename = "{dt.year}.{dt.month}.{dt.day}".format(dt=dt) + name + timestamp + ".txt"
os.rename(filename, new_filename)
print new_filename
print "All done with the Mrs"
I've made the adjustments suggested in Using Regex to Change Filenames with Python but still no luck.
EDIT: Made the following changes based on answer below:
for filename in os.listdir("/Users/jolijttamanaha/Desktop/thesis2/Republicans/CRspeeches"):
if filename.startswith("Mr."):
print filename
m = re.search("^Mr.\s(\w*).(\d\d\d\d\-\d\d\-\d\d).(\w*).txt", filename)
if m:
date = m.group(2)
name = m.group(1)
timestamp = m.group(3)
dt = parse(date)
new_filename = "{dt.year}.{dt.month}.{dt.day}".format(dt=dt) + name + timestamp + ".txt"
os.rename(filename, new_filename)
print new_filename
print "All done with the Mr"
And it spit out this:
Mr. Adams was right.2009-05-18.2014sep17_at_22240.txt
Mr. ADAMS.2009-12-16.2014sep18_at_223650.txt
Traceback (most recent call last):
File "changefilenames.py", line 19, in <module>
os.rename(filename, new_filename)
OSError: [Errno 2] No such file or directory

You are passing bare file names to os.rename, probably with missing paths.
Consider the following layout:
yourscript.py
subdir/
- one
- two
This is similar to your code:
import os
for fn in os.listdir('subdir'):
print(fn)
os.rename(fn, fn + '_moved')
and it throws an exception (somewhat nicer in Python 3):
FileNotFoundError: [Errno 2] No such file or directory: 'two' -> 'two_moved'
because in the current working directory, there is no file named two. But consider this:
import os
for fn in os.listdir('subdir'):
print(fn)
os.rename(os.path.join('subdir',fn), os.path.join('subdir', fn+'_moved'))
This works, because the full path is used. Instead of using 'subdir' again and again (or in a variable), you should perhaps change the working directory as a first step:
import os
os.chdir('subdir')
for fn in os.listdir():
print(fn)
os.rename(fn, fn + '_moved')

After you do a search, you'll always want to make sure you have a match before doing any processing. It looks like you may have a file that starts with 'Mr.' but doesn't match your expression in general.
if filename.startswith("Mr."):
m = re.search("Mr.\s(\w*).(\d\d\d\d\-\d\d\-\d\d).(\w*).txt", filename)
if m: # Only look at groups if we have a match.
date = m.group(2)
name = m.group(1)
....
I would also suggest not using startswith('Mr.') and regex at the same time, since your regex should already only work on strings that start with 'Mr.', though you may want to add a '^' to the beginning of the regex to enforce this:
m = re.search("^Mr.\s(\w*).(\d\d\d\d\-\d\d\-\d\d).(\w*).txt", filename)
if m: # ^ added carat to signify start of string.
date = m.group(2)
name = m.group(1)
...
Additionally, you may want to verify what files you are not matching, since with that much data, you will often run into problems like extra whitespace or improper case, so you may want to look into making your regex more robust.

Python Regex or Filename Function

Question about rename file name in folder. My file name looks like this:
EPG CRO 24 Kitchen 09.2013.xsl
With name space between, and I used code like this:
#!/usr/bin/python
# -*- coding: utf-8 -*-
# Remove whitespace from files where EPG named with space " " replace with "_"
for filename in os.listdir("."):
if filename.find("2013|09 ") > 0:
newfilename = filename.replace(" ","_")
os.rename(filename, newfilename)
With this code I removed white space, but how can I remove date, from file name so it can look like this: EPG_CRO_24_Kitche.xsl. Can you give me some solution about this.

Regex
As utdemir was eluding to, regular expressions can really help in situations like these. If you have never been exposed to them, it can be confusing at first. Checkout https://www.debuggex.com/r/4RR6ZVrLC_nKYs8g for a useful tool that helps you construct regular expressions.
Solution
An updated solution would be:
import re
def rename_file(filename):
if filename.startswith('EPG') and ' ' in filename:
# \s+ means 1 or more whitespace characters
# [0-9]{2} means exactly 2 characters of 0 through 9
# \. means find a '.' character
# [0-9]{4} means exactly 4 characters of 0 through 9
newfilename = re.sub("\s+[0-9]{2}\.[0-9]{4}", '', filename)
newfilename = newfilename.replace(" ","_")
os.rename(filename, newfilename)
Side Note
# Remove whitespace from files where EPG named with space " " replace with "_"
for filename in os.listdir("."):
if filename.find("2013|09 ") > 0:
newfilename = filename.replace(" ","_")
os.rename(filename, newfilename)
Unless I'm mistaken, the from the comment you made above, filename.find("2013|09 ") > 0 won't work.
Given the following:
In [76]: filename = "EPG CRO 24 Kitchen 09.2013.xsl"
In [77]: filename.find("2013|09 ")
Out[77]: -1
And your described comment, you might want something more like:
In [80]: if filename.startswith('EPG') and ' ' in filename:
....: print('process this')
....:
process this

If all file names have the same format: NAME_20XX_XX.xsl, then you can use python's list slicing instead of regex:
name.replace(' ','_')[:-12] + '.xsl'

If dates are always formatted same;
>>> s = "EPG CRO 24 Kitchen 09.2013.xsl"
>>> re.sub("\s+\d{2}\.\d{4}\..{3}$", "", s)
'EPG CRO 24 Kitchen'

How about little slicing:
newfilename = input1[:input1.rfind(" ")].replace(" ","_")+input1[input1.rfind("."):]

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Remove punctuation from file name while keeping file extension intact - python

There's only one right way to do this: os.path.splitext to get the filename and the extension Do whatever processing you want to the filename. Concatenate the new filename with the extension.

You could use a negative lookahead, that asserts that you are not dealing with a dot that is only followed by digits and letters: re.sub(r'(?!\.[A-Za-z0-9]*$)[^A-Za-z0-9]', ' ', filename)

I suggest you to replace each occurrence of [\W_](?=.*\.) with space .

See if this works for you. You can actually do it without Regex >>> fname="Flowers.Rose-Murree-[25.10.11].jpg" >>> name,ext=os.path.splitext(fname) >>> name = name.translate(None,string.punctuation) >>> name += ext >>> name 'FlowersRoseMurree251011.jpg' >>>

Related

Python - Check for exact string in file name

python regex: Parsing file name

Accessing a file path saved in a .txt file. (Python)

Getting NoneType Error When Using Regex to Change Filenames in Python

Python Regex or Filename Function

Categories

Resources