Regular expression to match start of filename and filename extension

Regular expression to match start of filename and filename extension - python

What is the regular expression to match strings (in this case, file names) that start with 'Run' and have a filename extension of '.py'?
The regular expression should match any of the following:
RunFoo.py
RunBar.py
Run42.py
It should not match:
myRunFoo.py
RunBar.py1
Run42.txt
The SQL equivalent of what I am looking for is ... LIKE 'Run%.py' ....

For a regular expression, you would use:
re.match(r'Run.*\.py$')
A quick explanation:
. means match any character.
* means match any repetition of the previous character (hence .* means any sequence of chars)
\ is an escape to escape the explicit dot
$ indicates "end of the string", so we don't match "Run_foo.py.txt"
However, for this task, you're probably better off using simple string methods. ie.
filename.startswith("Run") and filename.endswith(".py")
Note: if you want case insensitivity (ie. matching "run.PY" as well as "Run.py", use the re.I option to the regular expression, or convert to a specific case (eg filename.lower()) before using string methods.

I don't really understand why you're after a regular expression to solve this 'problem'. You're just after a way to find all .py files that start with 'Run'. So this is a simple solution that will work, without resorting to compiling an running a regular expression:
import os
for filename in os.listdir(dirname):
root, ext = os.path.splitext(filename)
if root.startswith('Run') and ext == '.py':
print filename

Warning:
jobscry's answer ("^Run.?.py$") is incorrect (will not match "Run123.py", for example).
orlandu63's answer ("/^Run[\w]*?.py$/") will not match "RunFoo.Bar.py".
(I don't have enough reputation to comment, sorry.)

/^Run.*\.py$/
Or, in python specifically:
import re
re.match(r"^Run.*\.py$", stringtocheck)
This will match "Runfoobar.py", but not "runfoobar.PY". To make it case insensitive, instead use:
re.match(r"^Run.*\.py$", stringtocheck, re.I)

You don't need a regular expression, you can use glob, which takes wildcards e.g. Run*.py
For example, to get those files in your current directory...
import os, glob
files = glob.glob( "".join([ os.getcwd(), "\\Run*.py"]) )

If you write a slightly more complex regular expression, you can get an extra feature: extract the bit between "Run" and ".py":
>>> import re
>>> regex = '^Run(?P<name>.*)\.py$'
>>> m = re.match(regex, 'RunFoo.py')
>>> m.group('name')
'Foo'
(the extra bit is the parentheses and everything between them, except for '.*' which is as in Rob Howard's answer)

This probably doesn't fully comply with file-naming standards, but here it goes:
/^Run[\w]*?\.py$/

mabye:
^Run.*\.py$
just a quick try

Related

I want to split file’s extension by using regular expression

item is a string like "./test/test1.csv" .
I want to change item into "test1".
I wrote code,
item=re.search('./*.csv',item)
But,"1.csv" is item.I really cannot understand why such a thing happens.What should I do to do my ideal thing?

As a regex, './*.csv' does not mean what you think it does. A . means "any char" and a * means "zero or more of what came before", Thus, it's not "dot, slash, any string, dot, csv", but "any char, some slashes, any char csv".
If you really want to use a regex, you could try, e.g., this (among many other variants):
>>> re.search(r"([^/]+)\.[^\.]+$", p).group(1)
'test1'
Or just use str.split and rsplit:
>>> p.rsplit("/", 1)[-1].split(".")[0]
'test1'
Or, since you are handling file paths, how about os.path?
>>> os.path.splitext(os.path.split(p)[1])[0]
'test1'

Usually you want the os.path module from the standard library for this kind of filename.
import os.path
print(os.path.splitext(os.path.basename('./test/test1.csv'))[0])
In your regular expression version of this, remember that . matches any character (not just periods), x* matches any number of x's (even zero), and that re.search will return true if the pattern matches anywhere in the string: your regular expression matches whenever a filename contains the letters "csv" anywhere later than the first or second character. A correct regular expression implementation might be
import re
print(re.search(r'/([^/.]+)\.[^/]+$', './test/test1.csv')[1])
(matching a slash, at least one character that is neither a period nor a slash, a period, at least one character that is not a slash, and end of string). (IMHO os.path is more readable and maintainable.)

This will work
import ntpath
path = ".../test/test1.csv"
file_name = ntpath.basename(path)

Look for files not having a given extension

I'v tried the following code.
import re
regobj = re.compile(r"^.+\.(oth|xyz)$")
for test in ["text.txt", "other.oth", "abc.xyz"]:
if regobj.match(test):
print("Method 1:", test)
regobj = re.compile(r"^.+\.[^txt]$")
for test in ["text.txt", "other.oth", "abc.xyz"]:
if regobj.match(test):
print("Method 2:", test)
I would like that the 2nd method finds any file not having the extension txt but the way I try is not the good one. What am I doing wrong ?

Regular expressions are overkill here. Use the str.endswith() method:
if not str.endswith('.txt'):
Your regular expression uses a negative character class, which is a set of characters that should not be matched. Anything that is not a t or x will satisfy that test. You could have explicitly matched .txt and used not to exclude rather than include:
regobj = re.compile(r"^.+\.txt$")
if not regobj.match(test):
If all you can use is regular expressions, use negative look-ahead assertions;
regobj = re.compile(r"^[^.]+\.(?!txt$)[^.]+$")
Here (?!...) only matches locations where there is no literal txt following, all the way to the end of the string. The [^.]+ then matches any number of characters that is not a . character until the end of the string.

Change the second regex to,
regobj = re.compile(r"^.+\.(?!txt$)[^.]+$")
[^txt] matches any character not of t or x. (?!txt$) asserts that the dot won't be followed by txt . And the [^.]+ after \. asserts that there must be atleast one single char must exists just after to the dot. So this matches the filenames which has any extensions but not of .txt

As Martijn Pieters mentioned regex is overkill, considering there are other more efficient ways:
fileName, fileExt = os.path.splitext(string)
Using splitext it's simple to isolate the extension.
import os
fileDict = ["text.txt", "other.oth", "abc.xyz"]
matchExt = ".txt"
for eachFile in fileDict:
fileName, fileExt = os.path.splitext(eachFile)
if matchExt not in fileExt:
print("(not %s) %s %s" % (matchExt, fileExt, fileName))
You can easily add an else statement to match other extensions, which I'll leave up to you.

python glob2/formic style recursive wildcard pattern search for lists

I need a GLOB2 or FORMIC like solution to search a large list of directories in a text file (the files aren't on my machine, the file list is produced by an external process i cannot directly access or query)
pseudo example:
# read the large directory list in memory
data = []
with open('C:\\log_file.txt','r') as log:
data = log.readlines()
# query away!
query1 = listglob(data,'/**/fnord/*/log.*')
query2 = listglob(data,'/usr/*/model_*/fnord/**')
Unless someone has a suggestion, my next step is to open up glob2 and formic and see if one of them can be changed to accept a list instead of a root folder to be "os.walked"

I would recommend using regular expressions. Ultimately, both Formic and glob use an OS call to perform the actual glob matching. So, if you want to modify either, you're going to have to write a RE matcher (or similar) in any case. So, cut out the middle-man and go straight to REs. (It pains me to say that because I'm the author of Formic).
The basic plan is to write a function that takes in your glob and returns a regular expression. Here are some pointers:
Escape and ., - and other RE reserved characters in your globs. Eg . becomes \.
A ? in a glob file/directory becomes [^/] (matches a single character that's not a /)
A * in a glob file/directory name as a regular expression is [^/]*
A /*/ glob as a regular expression is: /[^/]+/
A /**/ glob as a regular expression is: /([^/]+/)*
To match a whole line, start the RE with a ^ and end it with $. This forces the RE to expand over the whole string.
While I listed the substitutions in order of increasing complexity, it's probably a good idea to do the substitutions in the following order:
Special RE characters that are not globs (., -, '$', etc)
?
/**/
/*/
*
This way you won't corrupt the /**/ when substituting for a single *.
In your question you have: /**/fnord/*/log.*. This would map to:
^/([^/]+/)*fnord/[^/]+/log\.[^/]*
Once you've built your RE, then finding matches is a simple exercise.

In the end i used one of glob2's functions, like so:
import glob2
def listglob(data,pattern):
return [x for x in items if glob2.fnmatch.fnmatch(x,pattern)]

I dont think the glob2.fnmatch.fnmatch is equivalent to the glob2 ** syntax.
It is equivalent to the fnmatch syntax from what i can tell from reading the source code.
Also Andrew's answer doesn't cover the square brackets. and the [!abc] example

De-greedifying a regular expression in python

I'm trying to write a regular expression that will convert a full path filename to a short filename for a given filetype, minus the file extension.
For example, I'm trying to get just the name of the .bar file from a string using
re.search('/(.*?)\.bar$', '/def_params/param_1M56/param/foo.bar')
According to the Python re docs, *? is the ungreedy version of *, so I was expecting to get
'foo'
returned for match.group(1) but instead I got
'def_params/param_1M56/param/foo'
What am I missing here about greediness?

What you're missing isn't so much about greediness as about regular expression engines: they work from left to right, so the / matches as early as possible and the .*? is then forced to work from there. In this case, the best regex doesn't involve greediness at all (you need backtracking for that to work; it will, but could take a really long time to run if there are a lot of slashes), but a more explicit pattern:
'/([^/]*)\.bar$'

I would suggest changing your regex so that it doesn't rely on greedyness.
You want only the filename before the extension .bar and everything after the final /. This should do:
re.search(`/[^/]*\.bar$`, '/def_params/param_1M56/param/foo.bar')
What this does is it matches /, then zero or more characters (as much as possible) that are not / and then .bar.

I don't claim to understand the non-greedy operators all that well, but a solution for that particular problem would be to use ([^/]*?)

The regular expressions starts from the right. Put a .* at the start and it should work.

I like regex but there is no need of one here.
path = '/def_params/param_1M56/param/foo.bar'
print path.rsplit('/',1)[1].rsplit('.')[0]
path = '/def_params/param_1M56/param/fululu'
print path.rsplit('/',1)[1].rsplit('.')[0]
path = '/def_params/param_1M56/param/one.before.two.dat'
print path.rsplit('/',1)[1].rsplit('.',1)[0]
result
foo
fululu
one.before.two

Other people have answered the regex question, but in this case there's a more efficient way than regex:
file_name = path[path.rindex('/')+1 : path.rindex('.')]

try this one on for size:
match = re.search('.*/(.*?).bar$', '/def_params/param_1M56/param/foo.bar')

Matching "~" at the end of a filename with a python regular expression

I'm working in a script (Python) to find some files. I compare names of files against a regular expression pattern. Now, I have to find files ending with a "~" (tilde), so I built this regex:
if re.match("~$", string_test):
print "ok!"
Well, Python doesn't seem to recognize the regex, I don't know why. I tried the same regex in other languages and it works perfectly, any idea?
PD: I read in a web that I have to insert
# -*- coding: utf-8 -*-
but doesn't help :( .
Thanks a lot, meanwhile I'm going to keep reading to see if a find something.

re.match() is only successful if the regular expression matches at the beginning of the input string. To search for any substring, use re.search() instead:
if re.search("~$", string_test):
print "ok!"

Your regex will only match strings "~" and (believe it or not) "~\n".
You need re.match(r".*~$", whatever) ... that means zero or more of (anything except a newline) followed by a tilde followed by (end-of-string or a newline preceding the end of string).
In the unlikely event that a filename can include a newline, use the re.DOTALL flag and use \Z instead of $.
"worked" in other languages: you must have used a search function.
r at the beginning of a string constant means raw escapes e.g. '\n' is a newline but r'\n' is two characters, a backslash followed by n -- which can also be represented by '\n'. Raw escapes save a lot of \\ in regexes, one should use r"regex" automatically
BTW: in this case avoid the regex confusion ... use whatever.endswith('~')

For finding files, use glob instead,
import os
import glob
path = '/path/to/files'
os.chdir(path)
files = glob.glob('./*~')
print files

The correct regex and the glob solution have already been posted. Another option is to use the fnmatch module:
import fnmatch
if fnmatch.fnmatch(string_test, "*~"):
print "ok!"
This is a tiny bit easier than using a regex. Note that all methods posted here are essentially equivalent: fnmatch is implemented using regular expressions, and glob in turn uses fnmatch.
Note that only in 2009 a patch was added to fnmatch (after six years!) that added support for file names with newlines.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Regular expression to match start of filename and filename extension - python

Warning: jobscry's answer ("^Run.?.py$") is incorrect (will not match "Run123.py", for example). orlandu63's answer ("/^Run[\w]*?.py$/") will not match "RunFoo.Bar.py". (I don't have enough reputation to comment, sorry.)

/^Run.\.py$/ Or, in python specifically: import re re.match(r"^Run.\.py$", stringtocheck) This will match "Runfoobar.py", but not "runfoobar.PY". To make it case insensitive, instead use: re.match(r"^Run.*\.py$", stringtocheck, re.I)

You don't need a regular expression, you can use glob, which takes wildcards e.g. Run.py For example, to get those files in your current directory... import os, glob files = glob.glob( "".join([ os.getcwd(), "\\Run.py"]) )

This probably doesn't fully comply with file-naming standards, but here it goes: /^Run[\w]*?\.py$/

mabye: ^Run.*\.py$ just a quick try

Related

I want to split file’s extension by using regular expression

Look for files not having a given extension

python glob2/formic style recursive wildcard pattern search for lists

De-greedifying a regular expression in python

Matching "~" at the end of a filename with a python regular expression

Categories

Resources

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Regular expression to match start of filename and filename extension - python

Warning: jobscry's answer ("^Run.?.py$") is incorrect (will not match "Run123.py", for example). orlandu63's answer ("/^Run[\w]*?.py$/") will not match "RunFoo.Bar.py". (I don't have enough reputation to comment, sorry.)

/^Run.*\.py$/ Or, in python specifically: import re re.match(r"^Run.*\.py$", stringtocheck) This will match "Runfoobar.py", but not "runfoobar.PY". To make it case insensitive, instead use: re.match(r"^Run.*\.py$", stringtocheck, re.I)

You don't need a regular expression, you can use glob, which takes wildcards e.g. Run*.py For example, to get those files in your current directory... import os, glob files = glob.glob( "".join([ os.getcwd(), "\\Run*.py"]) )

This probably doesn't fully comply with file-naming standards, but here it goes: /^Run[\w]*?\.py$/

mabye: ^Run.*\.py$ just a quick try

Related

I want to split file’s extension by using regular expression

Look for files not having a given extension

python glob2/formic style recursive wildcard pattern search for lists

De-greedifying a regular expression in python

Matching "~" at the end of a filename with a python regular expression

Categories

Resources

/^Run.\.py$/ Or, in python specifically: import re re.match(r"^Run.\.py$", stringtocheck) This will match "Runfoobar.py", but not "runfoobar.PY". To make it case insensitive, instead use: re.match(r"^Run.*\.py$", stringtocheck, re.I)

You don't need a regular expression, you can use glob, which takes wildcards e.g. Run.py For example, to get those files in your current directory... import os, glob files = glob.glob( "".join([ os.getcwd(), "\\Run.py"]) )