Rename all xml files within a given directory with Python

Rename all xml files within a given directory with Python - python

I have lot of xml files which are named like:
First_ExampleXML_Only_This_Should_Be_Name_20211234567+1234565.xml
Second_ExampleXML_OnlyThisShouldBeName_202156789+55684894.xml
Third_ExampleXML_Only_This_Should_Be_Name1_2021445678+6963696.xml
Fourth_ExampleXML_Only_This_Should_Be_Name2_20214567+696656.xml
I have to make a script that will go through all of the files and rename them, so only this is left from the example:
Only_This_Should_Be_Name.xml
OnlyThisShouldBeName.xml
Only_This_Should_Be_Name1xml
Only_This_Should_Be_Name2.xml
At the moment I have something like this but really struggling to get exactly what I need, guess that have to count from second _ up to _202, and take everything in between.
fnames = listdir('.')
for fname in fnames:
# replace .xml with type of file you want this to have impact on
if fname.endswith('.xml):
Anyone has idea what would be the best approach to do it?

You can strip the contents by splitting with underscores for all xml files and rename with the first value in the list as below.
import os
fnames = os.listdir('.')
for fname in fnames:
# replace .xml with type of file you want this to have impact on
if fname.endswith('.xml'):
newName = '_'.join(fname.split("_")[2:-1])
os.rename(fname, newName+".xml")
else:
continue
here you are eliminating the values which are before and after "_".

There are two problems here:
Finding files of one kind in the directory
Whilst listdir will work, you might as well glob them:
from pathlib import Path
for fn in Path("/path").glob("*.xml"):
....
Renaming files
In this case your files are named "file_name_NUMBERS.xml" and we want to strip the numbers out, so we'll use a regex: Edit: this is not the best way in this case. Just split and combine as in the other answer
import re
from pathlib import Path
for fn in Path("dir").glob("*.xml"):
new_name = re.search(r"(.*?)_[0-9]+", fn.stem).group(1)
fn.rename(fn.with_name(new_name + ".xml"))
Edit: don't know why I overcomplicted things. I'll leave the re solution there for more difficult cases, but in this case you can just do:
new_name = "_".join(fn.stem.split("_")[:-1])
Which is greately superior as it doesn't depend on the precise naming of the files.
Note that you can do all this without pathlib, but you asked for the best way ;)
Lastly, to answer an implicit question, nothing stops you wrapping all this in a function and passing an argument to glob for different types of files.

I think regex will be the simplest approach here, which in python can be accomplished with the re module.
import os
import re
fnames = os.listdir('.')
for fname in fnames:
result = re.sub(r"^.*?_ExampleXML_(.*?)_[\d+]+\.xml$", r"\1.xml", fname)
if result != fname:
os.rename(fname, result)
There are several pattern matching strategies you could employ, depending on your use case.
For instance you could try variants like the following, depending on how specific/general you need to be:
^.*?_ExampleXML_(.*?)_\d+\.xml$ (https://regex101.com/r/hYOLMF/1)
^.*?_ExampleXML_(.*?)_2021\d+\.xml$ (https://regex101.com/r/UzEsbO/1)
^.*?_ExampleXML_(.*?)_[^_]+\.xml$ (https://regex101.com/r/lKzYhq/1)

Related

Deleting the useless output files using Python

After I execute a python script from a particular directory, I get many output files but apart from 5-6 files I want to delete the rest from that directory. What I have done is, I have taken those 5-6 useful files inside a list and deleted all the other files which are not there in that list. Below is my code:
list1=['prog_1.py', 'prog_2.py', 'prog_3.py'] #Extend
import os
dir = '/home/dev/codes' #Change accordingly
for f in os.listdir(dir):
if f not in list1:
os.remove(os.path.join(dir, f))
Now here I just want to add one more thing, if the output files start with output_of_final, then I don't want them to be deleted. How can I do it? Should I use regex?

You could use Regex, but that's overkill here. Just use the str.startswith method.
Also, it's bad practice to use reserved keywords, built-in types and functions as variable names. I have renamed dir to directory. (https://docs.python.org/3/library/functions.html#dir)
list1 = ['prog_1.py', 'prog_2.py', 'prog_3.py'] # Extend
import os
directory = '/home/dev/codes' # Change accordingly
for f in os.listdir(directory):
if f not in list1 and not f.startswith('output_of_final'):
os.remove(os.path.join(directory, f))

yes the regex works here, but there are easier options like using startswith method for strings
list1=['prog_1.py', 'prog_2.py', 'prog_3.py'] #Extend
import os
dir = '/home/dev/codes' #Change accordingly
for f in os.listdir(dir):
if (f not in list1) and (not f.startswith('output_of_final')):
os.remove(os.path.join(dir, f))

How can I read files with similar names on python, rename them and then work with them?

I've already posted here with the same question but I sadly I couldn't come up with a solution (even though some of you guys gave me awesome answers but most of them weren't what I was looking for), so I'll try again and this time giving more information about what I'm trying to do.
So, I'm using a program called GMAT to get some outputs (.txt files with numerical values). These outputs have different names, but because I'm using them to more than one thing I'm getting something like this:
GMATd_1.txt
GMATd_2.txt
GMATf_1.txt
GMATf_2.txt
Now, what I need to do is to use these outputs as inputs in my code. I need to work with them in other functions of my script, and since I will have a lot of these .txt files I want to rename them as I don't want to use them like './path/etc'.
So what I wanted was to write a loop that could get these files and rename them inside the script so I can use these files with the new name in other functions (outside the loop).
So instead of having to this individually:
GMATds1= './path/GMATd_1.txt'
GMATds2= './path/GMATd_2.txt'
I wanted to write a loop that would do that for me.
I've already tried using a dictionary:
import os
import fnmatch
dict = {}
for filename in os.listdir('.'):
if fnmatch.fnmatch(filename, 'thing*.txt'):
examples[filename[:6]] = filename
This does work but I can't use the dictionary key outside the loop.

If I understand correctly, you try to fetch files with similar names (at least a re-occurring pattern) and rename them. This can be accomplished with the following code:
import glob
import os
all_files = glob.glob('path/to/directory/with/files/GMAT*.txt')
for file in files:
new_path = create_new_path(file) # possibly split the file name, change directory and/or filename
os.rename(file, new_path)
The glob library allows for searching files with * wildcards and makes it hence possible to search for files with a specific pattern. It lists all the files in a certain directory (or multiple directories if you include a * wildcard as a directory). When you iterate over the files, you could either directly work with the input of the files (as you apparently intend to do) or rename them as shown in this snippet. To rename them, you would need to generate a new path - so you would have to write the create_new_path function that takes the old path and creates a new one.

Since python 3.4 you should be using the built-in pathlib package instead of os or glob.
from pathlib import Path
import shutil
for file_src in Path("path/to/files").glob("GMAT*.txt"):
file_dest = str(file_src.resolve()).replace("ds", "d_")
shutil.move(file_src, file_dest)

you can use
import os
path='.....' # path where these files are located
path1='.....' ## path where you want these files to store
i=1
for file in os.listdir(path):
if file.endswith(end='.txt'):
os.rename(path + "/" + file, path1 + "/"+str(i) + ".txt")
i+=1
it will rename all the txt file in the source folder to 1,2,3,....n.txt

In Python, How do I check whether a file exists starting or ending with a substring?

I know about os.path.isfile(fname), but now I need to search if a file exists that is named FILEnTEST.txt where n could be any positive integer (so it could be FILE1TEST.txt or FILE9876TEST.txt)
I guess a solution to this could involve substrings that the filename starts/ends with OR one that involves somehow calling os.path.isfile('FILE' + n + 'TEST.txt') and replacing n with any number, but I don't know how to approach either solution.

You would need to write your own filtering system, by getting all the files in a directory and then matching them to a regex string and seeing if they fail the test or not:
import re
pattern = re.compile("FILE\d+TEST.txt")
dir = "/test/"
for filepath in os.listdir(dir):
if pattern.match(filepath):
#do stuff with matching file
I'm not near a machine with Python installed on it to test the code, but it should be something along those lines.

You can use a regular expression:
/FILE\d+TEST.txt/
Example: regexr.com.
Then you can use said regular expression and iterate through all of the files in a directory.
import re
import os
filename_re = 'FILE\d+TEST.txt'
for filename in os.listdir(directory):
if re.search(filename_re, filename):
# this file has the form FILEnTEST.txt
# do what you want with it now

You can also do it as such:
import os
import re
if len([file for file in os.listdir(directory) if re.search('regex', file)]):
# there's at least 1 such file

Glob search files in date order?

I have this line of code in my python script. It searches all the files in in a particular directory for * cycle *.log.
for searchedfile in glob.glob("*cycle*.log"):
This works perfectly, however when I run my script to a network location it does not search them in order and instead searches randomly.
Is there a way to force the code to search by date order?
This question has been asked for php but I am not sure of the differences.
Thanks

To sort files by date:
import glob
import os
files = glob.glob("*cycle*.log")
files.sort(key=os.path.getmtime)
print("\n".join(files))
See also Sorting HOW TO.

Essentially the same as #jfs but in one line using sorted
import os,glob
searchedfiles = sorted(glob.glob("*cycle*.log"), key=os.path.getmtime)

Well. The answer is nope. glob uses os.listdir which is described by:
"Return a list containing the names of the entries in the directory given by path. The list is in arbitrary order. It does not include the special entries '.' and '..' even if they are present in the directory."
So you are actually lucky that you got it sorted. You need to sort it yourself.
This works for me:
import glob
import os
import time
searchedfile = glob.glob("*.cpp")
files = sorted( searchedfile, key = lambda file: os.path.getctime(file))
for file in files:
print("{} - {}".format(file, time.ctime(os.path.getctime(file))) )
Also note that this uses creation time, if you want to use modification time, the function used must be getmtime.

If your paths are in sortable order then you can always sort them as strings (as others have already mentioned in their answers).
However, if your paths use a datetime format like %d.%m.%Y, it becomes a bit more involving. Since strptime does not support wildcards, we developed a module datetime-glob to parse the date/times from paths including wildcards.
Using datetime-glob, you could walk through the tree, list a directory, parse the date/times and sort them as tuples (date/time, path).
From the module's test cases:
import pathlib
import tempfile
import datetime_glob
def test_sort_listdir(self):
with tempfile.TemporaryDirectory() as tempdir:
pth = pathlib.Path(tempdir)
(pth / 'some-description-20.3.2016.txt').write_text('tested')
(pth / 'other-description-7.4.2016.txt').write_text('tested')
(pth / 'yet-another-description-1.1.2016.txt').write_text('tested')
matcher = datetime_glob.Matcher(pattern='*%-d.%-m.%Y.txt')
subpths_matches = [(subpth, matcher.match(subpth.name)) for subpth in pth.iterdir()]
dtimes_subpths = [(mtch.as_datetime(), subpth) for subpth, mtch in subpths_matches]
subpths = [subpth for _, subpth in sorted(dtimes_subpths)]
# yapf: disable
expected = [
pth / 'yet-another-description-1.1.2016.txt',
pth / 'some-description-20.3.2016.txt',
pth / 'other-description-7.4.2016.txt'
]
# yapf: enable
self.assertListEqual(subpths, expected)

Using glob no. Right now as you're using it, glob is storing all the files simultaneously in code and has no methods for organizing those files. If only the final result is important, you could use a second loop that checks the file's date and resorts based on that. If the parse order matters, glob is probably not the best way to do this.

You can sort the list of files that come back using os.path.getmtime or os.path.getctime. See this other SO answer and note the comments as well.

Problem reading text files without extensions in python

I have written a piece of a code which is supposed to read the texts inside several files which are located in a directory. These files are basically text files but they do not have any extensions.But my code is not able to read them:
corpus_path = 'Reviews/'
for infile in glob.glob(os.path.join(corpus_path,'*.*')):
review_file = open(infile,'r').read()
print review_file
To test if this code works, I put a dummy text file, dummy.txt. which worked because it has extension. But i don't know what should be done so files without the extensions could be read.
can someone help me? Thanks

Glob patterns don't work the same way as wildcards on the Windows platform. Just use * instead of *.*. i.e. os.path.join(corpus_path,'*'). Note that * will match every file in the directory - if that's not what you want then you can revise the pattern accordingly.
See the glob module documentation for more details.

Just use * instead of *.*.
The latter requires an extension to be present (more precisely, there needs to be a dot in the filename), the former doesn't.

You could search for * instead of *.*, but this will match every file in your directory.
Fundamentally, this means that you will have to handle cases where the file you are opening is not a text file.

it seems that you need
from os import listdir
from filename in ( fn for fn in listdir(corpus_path) if '.' not in fn):
# do something
you could write
from os import listdir
for fn in listdir(corpus_path):
if '.' not in fn:
# do something
but the former with a generator spares one indentation level

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Rename all xml files within a given directory with Python - python

Related

Deleting the useless output files using Python

How can I read files with similar names on python, rename them and then work with them?

In Python, How do I check whether a file exists starting or ending with a substring?

Glob search files in date order?

Problem reading text files without extensions in python

Categories

Resources