Look for files not having a given extension - python

I'v tried the following code.
import re
regobj = re.compile(r"^.+\.(oth|xyz)$")
for test in ["text.txt", "other.oth", "abc.xyz"]:
if regobj.match(test):
print("Method 1:", test)
regobj = re.compile(r"^.+\.[^txt]$")
for test in ["text.txt", "other.oth", "abc.xyz"]:
if regobj.match(test):
print("Method 2:", test)
I would like that the 2nd method finds any file not having the extension txt but the way I try is not the good one. What am I doing wrong ?

Regular expressions are overkill here. Use the str.endswith() method:
if not str.endswith('.txt'):
Your regular expression uses a negative character class, which is a set of characters that should not be matched. Anything that is not a t or x will satisfy that test. You could have explicitly matched .txt and used not to exclude rather than include:
regobj = re.compile(r"^.+\.txt$")
if not regobj.match(test):
If all you can use is regular expressions, use negative look-ahead assertions;
regobj = re.compile(r"^[^.]+\.(?!txt$)[^.]+$")
Here (?!...) only matches locations where there is no literal txt following, all the way to the end of the string. The [^.]+ then matches any number of characters that is not a . character until the end of the string.

Change the second regex to,
regobj = re.compile(r"^.+\.(?!txt$)[^.]+$")
[^txt] matches any character not of t or x. (?!txt$) asserts that the dot won't be followed by txt . And the [^.]+ after \. asserts that there must be atleast one single char must exists just after to the dot. So this matches the filenames which has any extensions but not of .txt

As Martijn Pieters mentioned regex is overkill, considering there are other more efficient ways:
fileName, fileExt = os.path.splitext(string)
Using splitext it's simple to isolate the extension.
import os
fileDict = ["text.txt", "other.oth", "abc.xyz"]
matchExt = ".txt"
for eachFile in fileDict:
fileName, fileExt = os.path.splitext(eachFile)
if matchExt not in fileExt:
print("(not %s) %s %s" % (matchExt, fileExt, fileName))
You can easily add an else statement to match other extensions, which I'll leave up to you.

Related

I want to split file’s extension by using regular expression

item is a string like "./test/test1.csv" .
I want to change item into "test1".
I wrote code,
item=re.search('./*.csv',item)
But,"1.csv" is item.I really cannot understand why such a thing happens.What should I do to do my ideal thing?
As a regex, './*.csv' does not mean what you think it does. A . means "any char" and a * means "zero or more of what came before", Thus, it's not "dot, slash, any string, dot, csv", but "any char, some slashes, any char csv".
If you really want to use a regex, you could try, e.g., this (among many other variants):
>>> re.search(r"([^/]+)\.[^\.]+$", p).group(1)
'test1'
Or just use str.split and rsplit:
>>> p.rsplit("/", 1)[-1].split(".")[0]
'test1'
Or, since you are handling file paths, how about os.path?
>>> os.path.splitext(os.path.split(p)[1])[0]
'test1'
Usually you want the os.path module from the standard library for this kind of filename.
import os.path
print(os.path.splitext(os.path.basename('./test/test1.csv'))[0])
In your regular expression version of this, remember that . matches any character (not just periods), x* matches any number of x's (even zero), and that re.search will return true if the pattern matches anywhere in the string: your regular expression matches whenever a filename contains the letters "csv" anywhere later than the first or second character. A correct regular expression implementation might be
import re
print(re.search(r'/([^/.]+)\.[^/]+$', './test/test1.csv')[1])
(matching a slash, at least one character that is neither a period nor a slash, a period, at least one character that is not a slash, and end of string). (IMHO os.path is more readable and maintainable.)
This will work
import ntpath
path = ".../test/test1.csv"
file_name = ntpath.basename(path)

How to test filename for occurrence of substring in Python?

I want to test whether my filename ends in 'pos.txt' or 'neg.txt' in Python.
So for example, if my filename is 'samplepaths5000neg.txt'--how can I test for the ending? I've tried various Python regex expressions but I can't seem to get it correct. It's for a script that runs different actions depending on the filename.
The regex for end of string is the $ symbol.
So if you did want to use a regex expression for this example it would be:
(?:pos|neg)\.txt$
The (?: allows the creation of a group that isn't reported. Omitting the question mark colon would cause the group to be reported which you could use to find out if the file ended in positive or negative.
import re
file_name = "samplepaths5000neg.txt"
correct_ending = re.match(r'(?:pos|neg)\.txt$',file_name) != None
or if you want to capture the ending
ending_result = re.match(r'(pos|neg)\.txt$',file_name)
ending = ending_result.group(1) if ending_result!=None else ''
Shown matched against different filenames :
https://regex101.com/r/9TiFMr/1

Match Naming Convention of Text File Regex

I am trying to find text files with a certain naming convention using regex, but have so far been unsuccessful.
The naming convention is file_[year]-[month]-[day].txt (eg. file_2010-09-15.txt).
Here is what I have so far: ^(file_)[0-9]{4}[-][0-9]{2}[-][0-9]{2}(\.txt)$
I'm trying to use it in my code like this:
for text_file in os.listdir(path):
if fnmatch.fnmatch(text_file, '^(file_)[0-9]{4}[-][0-9]{2}[-][0-9]{2}(\.txt)$'):
# print number of files found
I think the issue is because of the pattern type that fnmatch is expecting. In the documents it states the following:
This module provides support for Unix shell-style wildcards, which are not the same as regular expressions (which are documented in the re module). The special characters used in shell-style wildcards are:
Pattern Meaning
* matches everything
? matches any single character
[seq] matches any character in seq
[!seq] matches any character not in seq
`
You could keep it the way it is and just change it to that style of support, i.e.:
for text_file in os.listdir(path):
if fnmatch.fnmatch(text_file, 'file_[0-9][0-9][0-9][0-9]-[0-9][0-9]-[0-9][0-9].txt'):
# print number of files found
Or what I would suggest is using re.match like so:
regex = re.compile(r'^(file_)[0-9]{4}[-][0-9]{2}[-][0-9]{2}(\.txt)$')
for text_file in os.listdir(path):
if regex.match(text_file):
# print the text file
The fnmatch translates the regex to re python module. Take a look at the source code here. Basically, the shortcuts supported are:
Patterns are Unix shell style:
* matches everything
? matches any single character
[seq] matches any character in seq
[!seq] matches any char not in seq
Your regex is should be: 'file_[0-9][0-9][0-9][0-9]-[0-9][0-9]-[0-9][0-9].txt'
Alternatively, you can get started with re directly, without using fnmatch (take a code below as a starting point, but there is room for improvement: check whether a year is a valid year, a month is between 1-12 and a day is between 1 and 28,29,30, or 31):
import re
example_file = 'file_2010-09-15.txt'
myregex = 'file_\d\d\d\d-\d\d-\d\d\.txt'
result = re.match(myregex, example_file)
print(result.group(0))

How can I write a regular expression to replace hash-like strings

There are some windows names and folders containing names like:
c:\windows\serviceprofiles\localservice\appdata\locallow\microsoft\cryptneturlcache\metadata\57c8edb95df3f0ad4ee2dc2b8cfd4157
c:\windows\system32\config\systemprofile\appdata\locallow\microsoft\cryptneturlcache\metadata\be7ffd2fd84d3b32fd43dc8f575a9f28
c:\windows\softwaredistribution\download\ab1b092b40dee3ba964e8305ecc7d0d9
Notice how they end with a string that looks like a hash:
57c8edb95df3f0ad4ee2dc2b8cfd4157, be7ffd2fd84d3b32fd43dc8f575a9f28,
ab1b092b40dee3ba964e8305ecc7d0d9
I am not good with regex and I would like to know if there is a way to write a regex that would replace these hash-like names within a path with something like
"##HASH##"
The paths do not necessarily end with these, as these are usually folders/subfolders containing other folders of their own.
So my goal is to essentially get a path looking like:
c:\windows\serviceprofiles\localservice\appdata\locallow\microsoft\cryptneturlcache\metadata\57c8edb95df3f0ad4ee2dc2b8cfd4157\some_subfolder\some_file.inf
to become:
c:\windows\serviceprofiles\localservice\appdata\locallow\microsoft\cryptneturlcache\metadata##HASH##\some_subfolder\some_file.inf
Is there a way to do that in Python ?
Thanks in advance.
If you noticed, the "hashes" are 32 characters. (IF THIS IS TRUE FOR ALL OF THEM) Then the regex is pretty straightforward.
For example with the last string you posted
import re
text = 'c:\windows\serviceprofiles\localservice\appdata\locallow\microsoft\cryptneturlcache\metadata\\57c8edb95df3f0ad4ee2dc2b8cfd4157\some_subfolder\some_file.inf'
res = re.sub('\w{32}', '##HASH##', text)
print(res)
prints:
c:\windows\serviceprofiles\localserviceppdata\locallow\microsoft\cryptneturlcache\metadata\##HASH##\some_subfolder\some_file.inf
Notice how i escaped the \ with \\5 that's necessary to tell python it's a literal \5.
The \w{32} regex means "match any word character Exactly 32 times"
This might help:
import os
import re
uuid = re.compile('[0-9a-f]{30}\Z', re.I)
A = "c:\windows\serviceprofiles\localservice\appdata\locallow\microsoft\cryptneturlcache\metadata\57c8edb95df3f0ad4ee2dc2b8cfd4157\sub_folder"
path = os.path.normpath(A)
path = path.split(os.sep)
path = "\\".join(["##"+i+"##" if uuid.match(i) else i for i in path])
print path
Result:
c:\windows\serviceprofiles\localserviceppdata\locallow\microsoft\cryptneturlcache\metadata\##c8edb95df3f0ad4ee2dc2b8cfd4157##\sub_folder
Note: I am compiling for 30 chars in length. You can modify that value in re.compile

Regular expression to match start of filename and filename extension

What is the regular expression to match strings (in this case, file names) that start with 'Run' and have a filename extension of '.py'?
The regular expression should match any of the following:
RunFoo.py
RunBar.py
Run42.py
It should not match:
myRunFoo.py
RunBar.py1
Run42.txt
The SQL equivalent of what I am looking for is ... LIKE 'Run%.py' ....
For a regular expression, you would use:
re.match(r'Run.*\.py$')
A quick explanation:
. means match any character.
* means match any repetition of the previous character (hence .* means any sequence of chars)
\ is an escape to escape the explicit dot
$ indicates "end of the string", so we don't match "Run_foo.py.txt"
However, for this task, you're probably better off using simple string methods. ie.
filename.startswith("Run") and filename.endswith(".py")
Note: if you want case insensitivity (ie. matching "run.PY" as well as "Run.py", use the re.I option to the regular expression, or convert to a specific case (eg filename.lower()) before using string methods.
I don't really understand why you're after a regular expression to solve this 'problem'. You're just after a way to find all .py files that start with 'Run'. So this is a simple solution that will work, without resorting to compiling an running a regular expression:
import os
for filename in os.listdir(dirname):
root, ext = os.path.splitext(filename)
if root.startswith('Run') and ext == '.py':
print filename
Warning:
jobscry's answer ("^Run.?.py$") is incorrect (will not match "Run123.py", for example).
orlandu63's answer ("/^Run[\w]*?.py$/") will not match "RunFoo.Bar.py".
(I don't have enough reputation to comment, sorry.)
/^Run.*\.py$/
Or, in python specifically:
import re
re.match(r"^Run.*\.py$", stringtocheck)
This will match "Runfoobar.py", but not "runfoobar.PY". To make it case insensitive, instead use:
re.match(r"^Run.*\.py$", stringtocheck, re.I)
You don't need a regular expression, you can use glob, which takes wildcards e.g. Run*.py
For example, to get those files in your current directory...
import os, glob
files = glob.glob( "".join([ os.getcwd(), "\\Run*.py"]) )
If you write a slightly more complex regular expression, you can get an extra feature: extract the bit between "Run" and ".py":
>>> import re
>>> regex = '^Run(?P<name>.*)\.py$'
>>> m = re.match(regex, 'RunFoo.py')
>>> m.group('name')
'Foo'
(the extra bit is the parentheses and everything between them, except for '.*' which is as in Rob Howard's answer)
This probably doesn't fully comply with file-naming standards, but here it goes:
/^Run[\w]*?\.py$/
mabye:
^Run.*\.py$
just a quick try

Categories

Resources