Split filenames with python - python

I have files that I want only 'foo' and 'bar' left from split.
dn = "C:\\X\\Data\\"
files
f= C:\\X\\Data\\foo.txt
f= C:\\X\\Dats\\bar.txt
I have tried f.split(".",1)[0]
I thought since dn and .txt are pre-defined I could subtract, nope.
Split does not work for me.

How about using the proper path handling methods from os.path?
>>> f = 'C:\\X\\Data\\foo.txt'
>>> import os
>>> os.path.basename(f)
'foo.txt'
>>> os.path.dirname(f)
'C:\\X\\Data'
>>> os.path.splitext(f)
('C:\\X\\Data\\foo', '.txt')
>>> os.path.splitext(os.path.basename(f))
('foo', '.txt')

To deal with path and file names, it is best to use the built-in module os.path in Python. Please look at function dirname, basename and split in that module.

simple Example for your Help.
import os
from os import path
path_to_directory = "C:\\X\\Data"
for f in os.listdir(path_to_directory):
name , extension = path.splitext(f)
print(name)
Output
foo
bar

These two lines return a list of file names without extensions:
import os
[fname.rsplit('.', 1)[0] for fname in os.listdir("C:\\X\\Data\\")]
It seems you've left out some code. From what I can tell you're trying to split the contents of the file.
To fix your problem, you need to operate on a list of the files in the directory. That is what os.listdir does for you. I've also added a more sophisticated split. rsplit operates from the right, and will only split the first . it finds. Notice the 1 as the second argument.

another example:
f.split('\\')[-1].split('.')[0]

Using python3 and pathlib:
import pathlib
f = 'C:\\X\\Data\\foo.txt'
print(pathlib.PureWindowsPath(f).stem)
will print: 'foo'

Related

Rename all xml files within a given directory with Python

I have lot of xml files which are named like:
First_ExampleXML_Only_This_Should_Be_Name_20211234567+1234565.xml
Second_ExampleXML_OnlyThisShouldBeName_202156789+55684894.xml
Third_ExampleXML_Only_This_Should_Be_Name1_2021445678+6963696.xml
Fourth_ExampleXML_Only_This_Should_Be_Name2_20214567+696656.xml
I have to make a script that will go through all of the files and rename them, so only this is left from the example:
Only_This_Should_Be_Name.xml
OnlyThisShouldBeName.xml
Only_This_Should_Be_Name1xml
Only_This_Should_Be_Name2.xml
At the moment I have something like this but really struggling to get exactly what I need, guess that have to count from second _ up to _202, and take everything in between.
fnames = listdir('.')
for fname in fnames:
# replace .xml with type of file you want this to have impact on
if fname.endswith('.xml):
Anyone has idea what would be the best approach to do it?
You can strip the contents by splitting with underscores for all xml files and rename with the first value in the list as below.
import os
fnames = os.listdir('.')
for fname in fnames:
# replace .xml with type of file you want this to have impact on
if fname.endswith('.xml'):
newName = '_'.join(fname.split("_")[2:-1])
os.rename(fname, newName+".xml")
else:
continue
here you are eliminating the values which are before and after "_".
There are two problems here:
Finding files of one kind in the directory
Whilst listdir will work, you might as well glob them:
from pathlib import Path
for fn in Path("/path").glob("*.xml"):
....
Renaming files
In this case your files are named "file_name_NUMBERS.xml" and we want to strip the numbers out, so we'll use a regex: Edit: this is not the best way in this case. Just split and combine as in the other answer
import re
from pathlib import Path
for fn in Path("dir").glob("*.xml"):
new_name = re.search(r"(.*?)_[0-9]+", fn.stem).group(1)
fn.rename(fn.with_name(new_name + ".xml"))
Edit: don't know why I overcomplicted things. I'll leave the re solution there for more difficult cases, but in this case you can just do:
new_name = "_".join(fn.stem.split("_")[:-1])
Which is greately superior as it doesn't depend on the precise naming of the files.
Note that you can do all this without pathlib, but you asked for the best way ;)
Lastly, to answer an implicit question, nothing stops you wrapping all this in a function and passing an argument to glob for different types of files.
I think regex will be the simplest approach here, which in python can be accomplished with the re module.
import os
import re
fnames = os.listdir('.')
for fname in fnames:
result = re.sub(r"^.*?_ExampleXML_(.*?)_[\d+]+\.xml$", r"\1.xml", fname)
if result != fname:
os.rename(fname, result)
There are several pattern matching strategies you could employ, depending on your use case.
For instance you could try variants like the following, depending on how specific/general you need to be:
^.*?_ExampleXML_(.*?)_\d+\.xml$ (https://regex101.com/r/hYOLMF/1)
^.*?_ExampleXML_(.*?)_2021\d+\.xml$ (https://regex101.com/r/UzEsbO/1)
^.*?_ExampleXML_(.*?)_[^_]+\.xml$ (https://regex101.com/r/lKzYhq/1)

renaming the filename with regex in python using re

I have a folder which contains multiple files with a below filename as one example and I have multiple different such
_EGAZ00001018697_2014_ICGC_130906_D81P8DQ1_0153_C2704ACXX.nopd.AOCS_001_ICGCDBDE20130916001.rsem.bam
Now I want to rename then using only by ICGCDBDE20130916001.rsem.bam will change according to the file in the path. The string corresponding to the name *.rsem.bam should be the one separated by "_". So for all the files in the directory should be replaced accordingly by this. I am thinking to use the regular expression so I came up with the below pattern
pat=r'_(.*)_(.*)_(.*)_(.*)_(.\w+)'
This separates out my filename as desired and I can rename the filenames with by using a global variable where I take only pat[4]. I wanted to use python since I want to learn it as of now to make small changes as file renaming and so on and later with time convert my workflows in python. I am unable to do it. How should I make this work in python? Also am in a fix what should have been the corresponding bash regex since this one is a pretty big filename and my encounter with such is very new. Below was my code not to change directly but to understand if it works but how should I get it work if I want to rename them.
import re
import os
_src = "path/bam/test/"
_ext = ".rsem.bam"
endsWithNumber = re.compile(r'_(.*)_(.*)_(.*)_(.*)_(.\w+)'+(re.escape(_ext))+'$')
print(endsWithNumber)
for filename in os.listdir(_src):
m = endsWithNumber.search(filename)
print(m)
I would appreciate both in python and bash, however, I would prefer python for my own understanding and future learning.
You can use rpartition which will separate out the part you want from the rest in to a three part tuple.
Given:
>>> fn
'_EGAZ00001018697_2014_ICGC_130906_D81P8DQ1_0153_C2704ACXX.nopd.AOCS_001_ICGCDBDE20130916001.rsem.bam'
You can do:
>>> fn.rpartition('_')
('_EGAZ00001018697_2014_ICGC_130906_D81P8DQ1_0153_C2704ACXX.nopd.AOCS_001', '_', 'ICGCDBDE20130916001.rsem.bam')
Then:
>>> _,sep,new_name=fn.rpartition('_')
>>> new_name
'ICGCDBDE20130916001.rsem.bam'
If you want to use a regex:
>>> re.search(r'_([^_]+$)', fn).group(1)
'ICGCDBDE20130916001.rsem.bam'
As a practical matter, you would test to see if there was a match before using group(1):
>>> m=re.search(r'_([^_]+$)', fn)
>>> new_name = m.group(1) if m else fn
For sed you can do:
$ echo "$fn" | sed -E 's/.*_([^_]*)$/\1/'
ICGCDBDE20130916001.rsem.bam
Or in Bash, same regex:
$ [[ $fn =~ _([^_]*)$ ]] && echo "${BASH_REMATCH[1]}"
ICGCDBDE20130916001.rsem.bam
You can use list comprehension
import re
import os
_src = "path/bam/test/"
new_s = [re.search("[a-zA-Z0-9]+\.rsem\.bam", filename) for filename in os.listdir(_src)]
for first, second in zip(os.listdir(_src), new_s):
if second is not None:
os.rename(first, second.group(0))
Too much work.
newname = oldname.rsplit('_', 1)[1]
import os
fname = 'YOUR_FILENAME.avi'
fname1 = fname.split('.')
fname2 = str(fname1[0]) + '.mp4'
os.rename('path to your source file' + str(fname), 'path to your destination file' + str(fname2))
fname = fname2

How to insert strings and slashes in a path?

I'm trying to extract tar.gz files which are situated in diffent files named srm01, srm02 and srm03.
The file's name must be in input (a string) to run my code.
I'm trying to do something like this :
import tarfile
import glob
thirdBloc = 'srm01' #Then, that must be 'srm02', or 'srm03'
for f in glob.glob('C://Users//asediri//Downloads/srm/'+thirdBloc+'/'+'*.tar.gz'):
tar = tarfile.open(f)
tar.extractall('C://Users//asediri//Downloads/srm/'+thirdBloc)
I have this error message:
IOError: CRC check failed 0x182518 != 0x7a1780e1L
I want first to be sure that my code find the .tar.gz files. So I tried to just print my paths after glob:
thirdBloc = 'srm01' #Then, that must be 'srm02', or 'srm03'
for f in glob.glob('C://Users//asediri//Downloads/srm/'+thirdBloc+'/'+'*.tar.gz'):
print f
That gives :
C://Users//asediri//Downloads/srm/srm01\20160707000001-server.log.1.tar.gz
C://Users//asediri//Downloads/srm/srm01\20160707003501-server.log.1.tar.gz
The os.path.exists method tell me that my files doesn't exist.
print os.path.exists('C://Users//asediri//Downloads/srm/srm01\20160707000001-server.log.1.tar.gz')
That gives : False
Any way todo properly this work ? What's the best way to have first of all the right paths ?
In order to join paths you have to use os.path.join as follow:
import os
import tarfile
import glob
thirdBloc = 'srm01' #Then, that must be 'srm02', or 'srm03'
for f in glob.glob(os.path.join('C://Users//asediri//Downloads/srm/', thirdBloc, '*.tar.gz'):
tar = tarfile.open(f)
tar.extractall(os.path.join('C://Users//asediri//Downloads/srm/', thirdBloc))
os.path.join will create the correct paths for your filesystem
f = os.path.join('C://Users//asediri//Downloads/srm/', thirdBloc, '*.tar.gz')
C://Users//asediri//Downloads/srm/srm01\20160707000001-server.log.1.tar.gz
Never use \ with python for filepaths, \201 is \x81 character. It results to this:
C://Users//asediri//Downloads/srm/srm01ΓΌ60707000001-server.log.1.tar.gz
this is why os.path.exists does not find it
Or use (r"C:\...")
I would suggest you do this:
import os
os.chdir("C:/Users/asediri/Downloads/srm/srm01")
for f in glob.glob(str(thirdBloc) + ".tar.gz"):
print f

Rename a folder with source folder name matched by wildcard ("*")

I have a local folder named "abcd-1" and I want to do something like this:
import os
os.rename("abcd*", "abcd")
I know there's only one such folder so it's a valid operation, but it doesn't look like os.rename supports *. How can I solve it?
See glob
>>> import os, glob
>>> for f in glob.glob("abcd*"):
... os.rename(f, "abcd")
...
>>>
Check if there is only one result or use glob.glob("abcd*")[0] for first result.
Use os.path.isdir() to check whether it is a directory
You can use a combination of glob , os.path.isdir() function (to determine if it is a directory) , and then os.rename() to rename the actual file.
Example -
import glob
import os
import os.path
lst = glob.glob("abcd")
for element in lst:
if os.path.isdir(element):
os.rename(element,"abcd")
Use the glob module
eg
glob.glob("abcd*")
will return ["abcd-1"]
then you can rename the folder
You should probably use an assert statement to make sure theres only 1 result

read filenames and write directly into a list

Is it possible to get Python to look in a folder and put all of the filenames (with a certain extension) into a list?
e.g.:
[filename1.txt, filename2.txt,...]
You can do this easily with the glob module:
import glob
filenames = glob.glob('<some_path>/*.<extension>')
I always use os module for this and works perfectly for me.
import os
file_list = os.listdir(path)
print(file_list)
>>> ["file1.txt", "file2.txt", etc...]
Here's a quick answer I found.
import os
txt_files = filter(lambda x: x.endswith('.txt'), os.listdir('mydir'))

Categories

Resources