Python and regex - how to find anytext_NUMBER_svm.pkl - python

I have file names that are in this format:
anytext_NUMBER_svm.pkl
I need to loop thourgh all files in a dir and file files that look like this:
file1.txt
file2.txt
anytext_1_svm.pkl
anytext_2_svm.pkl
anytext_3_svm.pkl
The matched files will be this:
anytext_1_svm.pkl
anytext_2_svm.pkl
anytext_3_svm.pkl
How to I use python regex to do this?

An option that:
doesn't use re
makes sure the comparison is only on the filename part - not part of a path
restricts the number of filename patterns to validate further using iglob
Code:
from glob import iglob
import os.path
for fname in iglob('*_*_svm.pkl'):
path, name = os.path.split(fname)
anytext, digit, rest = name.split('_', 2)
if digit.isdigit(): # add criteria for anytext if required...
# ....

This regex shoud solve your problems:
>>> import re
>>> regex = re.compile(r'.+_\d+_svm\.pkl')
>>> regex.search('anytext_1_svm.pkl') != None
True
But you should definitely take a look at the documentation: http://docs.python.org/library/re.html

I would suggest a review of this page:
http://docs.python.org/py3k/library/re.html#module-re
It will help you understand how to write regular expressions and ensure that you are matching things properly. For the number, use [0-9]*, use _ to separate your groups, and write a little match-checking conditional stuff and this will be a quick project.

import glob
file_list = glob.glob('anytext_[0-9]_svm.pk1')

regex to catch "anytext_NUMBER_svm.pkl" is very simple.
r'.+_\d+_svm\.pkl'

Related

Rename all xml files within a given directory with Python

I have lot of xml files which are named like:
First_ExampleXML_Only_This_Should_Be_Name_20211234567+1234565.xml
Second_ExampleXML_OnlyThisShouldBeName_202156789+55684894.xml
Third_ExampleXML_Only_This_Should_Be_Name1_2021445678+6963696.xml
Fourth_ExampleXML_Only_This_Should_Be_Name2_20214567+696656.xml
I have to make a script that will go through all of the files and rename them, so only this is left from the example:
Only_This_Should_Be_Name.xml
OnlyThisShouldBeName.xml
Only_This_Should_Be_Name1xml
Only_This_Should_Be_Name2.xml
At the moment I have something like this but really struggling to get exactly what I need, guess that have to count from second _ up to _202, and take everything in between.
fnames = listdir('.')
for fname in fnames:
# replace .xml with type of file you want this to have impact on
if fname.endswith('.xml):
Anyone has idea what would be the best approach to do it?
You can strip the contents by splitting with underscores for all xml files and rename with the first value in the list as below.
import os
fnames = os.listdir('.')
for fname in fnames:
# replace .xml with type of file you want this to have impact on
if fname.endswith('.xml'):
newName = '_'.join(fname.split("_")[2:-1])
os.rename(fname, newName+".xml")
else:
continue
here you are eliminating the values which are before and after "_".
There are two problems here:
Finding files of one kind in the directory
Whilst listdir will work, you might as well glob them:
from pathlib import Path
for fn in Path("/path").glob("*.xml"):
....
Renaming files
In this case your files are named "file_name_NUMBERS.xml" and we want to strip the numbers out, so we'll use a regex: Edit: this is not the best way in this case. Just split and combine as in the other answer
import re
from pathlib import Path
for fn in Path("dir").glob("*.xml"):
new_name = re.search(r"(.*?)_[0-9]+", fn.stem).group(1)
fn.rename(fn.with_name(new_name + ".xml"))
Edit: don't know why I overcomplicted things. I'll leave the re solution there for more difficult cases, but in this case you can just do:
new_name = "_".join(fn.stem.split("_")[:-1])
Which is greately superior as it doesn't depend on the precise naming of the files.
Note that you can do all this without pathlib, but you asked for the best way ;)
Lastly, to answer an implicit question, nothing stops you wrapping all this in a function and passing an argument to glob for different types of files.
I think regex will be the simplest approach here, which in python can be accomplished with the re module.
import os
import re
fnames = os.listdir('.')
for fname in fnames:
result = re.sub(r"^.*?_ExampleXML_(.*?)_[\d+]+\.xml$", r"\1.xml", fname)
if result != fname:
os.rename(fname, result)
There are several pattern matching strategies you could employ, depending on your use case.
For instance you could try variants like the following, depending on how specific/general you need to be:
^.*?_ExampleXML_(.*?)_\d+\.xml$ (https://regex101.com/r/hYOLMF/1)
^.*?_ExampleXML_(.*?)_2021\d+\.xml$ (https://regex101.com/r/UzEsbO/1)
^.*?_ExampleXML_(.*?)_[^_]+\.xml$ (https://regex101.com/r/lKzYhq/1)

How to fix pattern, that I use to get list of files in folder with standard library glob?

I have the following files:
/tmp/test_glob/client.log.71.gz
/tmp/test_glob/client.log.63.gz
/tmp/test_glob/client.log.11
/tmp/test_glob/core_dump.log
/tmp/test_glob/client.log.32
/tmp/test_glob/dm.log
/tmp/test_glob/client.log
/tmp/test_glob/client.log.1
/tmp/test_glob/client.log.64.gz
I want to get all .log files, EXCEPT the files, that end with .gz.
The desired result should be the following:
/tmp/test_glob/client.log.11
/tmp/test_glob/core_dump.log
/tmp/test_glob/client.log.32
/tmp/test_glob/dm.log
/tmp/test_glob/client.log
/tmp/test_glob/client.log.1
I have written this simple code:
import glob
import os
glob_pattern = u'*.log*'
for log_path in glob.glob(os.path.join('/tmp/test_glob', glob_pattern)):
print('log_path: ', log_path)
but it returns all file from folder /tmp/test_glob/
I tried to modify this pattern like this:
glob_pattern = u'*.log.[0-9][0-9]'
but it returns only
/tmp/test_glob/client.log.11
/tmp/test_glob/client.log.32
How to fix this pattern ?
Using Pythex(a Python regex tester), the match string
glob_pattern = u'.*(\.log)(?!.*(gz)).*'
Worked well for your goal.
Try **/*.log!(*.gz)
Test using globster.xyz
That isn't a glob pattern. You don't want glob. You want to use the re module functions to filter the results of os.listdir.

In Python, How do I check whether a file exists starting or ending with a substring?

I know about os.path.isfile(fname), but now I need to search if a file exists that is named FILEnTEST.txt where n could be any positive integer (so it could be FILE1TEST.txt or FILE9876TEST.txt)
I guess a solution to this could involve substrings that the filename starts/ends with OR one that involves somehow calling os.path.isfile('FILE' + n + 'TEST.txt') and replacing n with any number, but I don't know how to approach either solution.
You would need to write your own filtering system, by getting all the files in a directory and then matching them to a regex string and seeing if they fail the test or not:
import re
pattern = re.compile("FILE\d+TEST.txt")
dir = "/test/"
for filepath in os.listdir(dir):
if pattern.match(filepath):
#do stuff with matching file
I'm not near a machine with Python installed on it to test the code, but it should be something along those lines.
You can use a regular expression:
/FILE\d+TEST.txt/
Example: regexr.com.
Then you can use said regular expression and iterate through all of the files in a directory.
import re
import os
filename_re = 'FILE\d+TEST.txt'
for filename in os.listdir(directory):
if re.search(filename_re, filename):
# this file has the form FILEnTEST.txt
# do what you want with it now
You can also do it as such:
import os
import re
if len([file for file in os.listdir(directory) if re.search('regex', file)]):
# there's at least 1 such file

How to get the substring from a String in python

I have a string path='/home/user/Desktop/My_file.xlsx'.
I want to extract the "My_file" substring. I am using Django framework for python.
I have tried to get it with:
re.search('/(.+?).xlsx', path).group(1)
but it returns the whole path again.
Can someone please help.
If you know that the file extension is always the same (e.g. ".xlsx") I would suggest you to go this way:
import os
filename_full = os.path.basename(path)
filename = filename_full.split(".xlsx")[0]
Hope it helps
More generally:
import os
filename = os.path.basename(os.path.splitext(path)[0])
If you need to match the exact extension:
# (?<=/) ensure that before the match is /
# [^/]*.xlsx search for anything but / followed by .xlsx
mo1 = re.search('(?<=/)[^/]*.xlsx', path).group(0)
print(mo1)
My_file.xlsx
otherwise:
path='/home/user/Desktop/My_file.xlsx'
with regex:
mo = re.search(r'(?<=/)([\w.]+$)',path)
print(mo.group(1))
My_file.xlsx
with rsplit:
my_file = path.rsplit('/')[-1]
print(my_file)
My_file.xlsx

remove part of path

I have the following data:
/​share/​Downloads/​Videos/​Movies/​Big.Buck.Bunny.​720p.​Bluray.​x264-BLA.​torrent/Big.Buck.Bunny.​720p.​Bluray.​x264-BLA
However, I dont want to have "Big.Buck.Bunny.​720p.​Bluray.​x264-BLA.torrent/" in it, I want the path to be like:
/​share/​Downloads/​Videos/​Movies/Big.Buck.Bunny.​720p.​Bluray.​x264-BLA
With regular expressions I basically want to math anything that holds *.torrent./, how can I accomplish this in regexp?
Thanks!
You don't even need regular expressions. You can use os.path.dirname and os.path.basename:
os.path.join(os.path.dirname(os.path.dirname(path)),
os.path.basename(path))
where path is the original path to the file.
Alternatively, you can also use os.path.split as follows:
dirname, filename = os.path.split(path)
os.path.join(os.path.dirname(dirname), filename)
Note This will work under the assumption that what you want to remove is the directory name that contains the file from the path as in the example in the question.
You can do this without using regexp:
>>> x = unicode('/share/Downloads/Videos/Movies/Big.Buck.Bunny.720p.Bluray.x264-BLA.torrent/Big.Buck.Bunny.720p.Bluray.x264-BLA')
>>> x.rfind('.torrent')
66
>>> x[:x.rfind('.torrent')]
u'/share/Downloads/Videos/Movies/Big.Buck.Bunny.720p.Bluray.x264-BLA'
I basically want to math anything that holds *.torrent./, how can I accomplish this in regexp?
You can use:
[^/]*\.torrent/
Assuming the last . was a typo.
Given path='/share/Downloads/Videos/Movies/Big.Buck.Bunny.720p.Bluray.x264-BLA.torrent/Big.Buck.Bunny.720p.Bluray.x264-BLA'
You can do it with regular expression as
re.sub("/[^/]*\.torrent/","",path)
You can also do it without regex as
'/'.join(x for x in path.split("/") if x.find("torrent") == -1)
Your question is a bit vague and unclear, but here's one way how to strip off what you want:
import re
s = "/share/Downloads/Videos/Movies/Big.Buck.Bunny.720p.Bluray.x264-BLA.torrent/Big.Buck.Bunny.720p.Bluray.x264-BLA"
c = re.compile("(/.*/).*?torrent/(.*)")
m = re.match(c, s)
path = m.group(1)
file = m.group(2)
print path + file
>>> ## working on region in file /usr/tmp/python-215357Ay...
/share/Downloads/Videos/Movies/Big.Buck.Bunny.720p.Bluray.x264-BLA

Categories

Resources