I'm working in a script (Python) to find some files. I compare names of files against a regular expression pattern. Now, I have to find files ending with a "~" (tilde), so I built this regex:
if re.match("~$", string_test):
print "ok!"
Well, Python doesn't seem to recognize the regex, I don't know why. I tried the same regex in other languages and it works perfectly, any idea?
PD: I read in a web that I have to insert
# -*- coding: utf-8 -*-
but doesn't help :( .
Thanks a lot, meanwhile I'm going to keep reading to see if a find something.
re.match() is only successful if the regular expression matches at the beginning of the input string. To search for any substring, use re.search() instead:
if re.search("~$", string_test):
print "ok!"
Your regex will only match strings "~" and (believe it or not) "~\n".
You need re.match(r".*~$", whatever) ... that means zero or more of (anything except a newline) followed by a tilde followed by (end-of-string or a newline preceding the end of string).
In the unlikely event that a filename can include a newline, use the re.DOTALL flag and use \Z instead of $.
"worked" in other languages: you must have used a search function.
r at the beginning of a string constant means raw escapes e.g. '\n' is a newline but r'\n' is two characters, a backslash followed by n -- which can also be represented by '\n'. Raw escapes save a lot of \\ in regexes, one should use r"regex" automatically
BTW: in this case avoid the regex confusion ... use whatever.endswith('~')
For finding files, use glob instead,
import os
import glob
path = '/path/to/files'
os.chdir(path)
files = glob.glob('./*~')
print files
The correct regex and the glob solution have already been posted. Another option is to use the fnmatch module:
import fnmatch
if fnmatch.fnmatch(string_test, "*~"):
print "ok!"
This is a tiny bit easier than using a regex. Note that all methods posted here are essentially equivalent: fnmatch is implemented using regular expressions, and glob in turn uses fnmatch.
Note that only in 2009 a patch was added to fnmatch (after six years!) that added support for file names with newlines.
Related
I'm trying to use re.findall to create a substring of a filepath that only gives me the part before the first backslash.
It's part of a for-loop with os.walk and I'm trying to get the first part of my root as the result of re.findall.
folder="Q:\\test\\test.gdb"
for root, dirs, files in os.walk(folder):
Quelle=re.findall('(.+?)\\',root)
This, however, produces error: bogus escape (end of line). From what I understand, this error is generated because I use the escape character at the end of the line. But, in my code example it is not at the end of the line? From my understanding I have to use it, to escape the backslash so that my string includes everything of the path up until the first backslash. Is there any way around it?
When I use
folder="Q:\\test\\test.gdb"
for root, dirs, files in os.walk(folder):
Quelle=re.findall('(.+?):',root)
I correctly get the list of strings ['Q']. But I want to include the : in my string.
The error is coming from re, not the interpreter.
Backslashes have meaning to regular expressions as escape characters too. For example, r'\.' is a literal period, not the class representing all characters, as '.' would be. A literal backslash therefore needs to be double escaped again at the regex level.
You have to paths going forward:
Double the backslash: '(.+?)\\\\'
Do the same thing, but use a raw string to make it look nicer: r'(.+?)\\'
Don't feel bad. There is even a special section in the official tutorial called "The Backslash Plague" dealing with just this situation.
You can also search the list of special characters in the official docs for more information.
item is a string like "./test/test1.csv" .
I want to change item into "test1".
I wrote code,
item=re.search('./*.csv',item)
But,"1.csv" is item.I really cannot understand why such a thing happens.What should I do to do my ideal thing?
As a regex, './*.csv' does not mean what you think it does. A . means "any char" and a * means "zero or more of what came before", Thus, it's not "dot, slash, any string, dot, csv", but "any char, some slashes, any char csv".
If you really want to use a regex, you could try, e.g., this (among many other variants):
>>> re.search(r"([^/]+)\.[^\.]+$", p).group(1)
'test1'
Or just use str.split and rsplit:
>>> p.rsplit("/", 1)[-1].split(".")[0]
'test1'
Or, since you are handling file paths, how about os.path?
>>> os.path.splitext(os.path.split(p)[1])[0]
'test1'
Usually you want the os.path module from the standard library for this kind of filename.
import os.path
print(os.path.splitext(os.path.basename('./test/test1.csv'))[0])
In your regular expression version of this, remember that . matches any character (not just periods), x* matches any number of x's (even zero), and that re.search will return true if the pattern matches anywhere in the string: your regular expression matches whenever a filename contains the letters "csv" anywhere later than the first or second character. A correct regular expression implementation might be
import re
print(re.search(r'/([^/.]+)\.[^/]+$', './test/test1.csv')[1])
(matching a slash, at least one character that is neither a period nor a slash, a period, at least one character that is not a slash, and end of string). (IMHO os.path is more readable and maintainable.)
This will work
import ntpath
path = ".../test/test1.csv"
file_name = ntpath.basename(path)
I am trying to find text files with a certain naming convention using regex, but have so far been unsuccessful.
The naming convention is file_[year]-[month]-[day].txt (eg. file_2010-09-15.txt).
Here is what I have so far: ^(file_)[0-9]{4}[-][0-9]{2}[-][0-9]{2}(\.txt)$
I'm trying to use it in my code like this:
for text_file in os.listdir(path):
if fnmatch.fnmatch(text_file, '^(file_)[0-9]{4}[-][0-9]{2}[-][0-9]{2}(\.txt)$'):
# print number of files found
I think the issue is because of the pattern type that fnmatch is expecting. In the documents it states the following:
This module provides support for Unix shell-style wildcards, which are not the same as regular expressions (which are documented in the re module). The special characters used in shell-style wildcards are:
Pattern Meaning
* matches everything
? matches any single character
[seq] matches any character in seq
[!seq] matches any character not in seq
`
You could keep it the way it is and just change it to that style of support, i.e.:
for text_file in os.listdir(path):
if fnmatch.fnmatch(text_file, 'file_[0-9][0-9][0-9][0-9]-[0-9][0-9]-[0-9][0-9].txt'):
# print number of files found
Or what I would suggest is using re.match like so:
regex = re.compile(r'^(file_)[0-9]{4}[-][0-9]{2}[-][0-9]{2}(\.txt)$')
for text_file in os.listdir(path):
if regex.match(text_file):
# print the text file
The fnmatch translates the regex to re python module. Take a look at the source code here. Basically, the shortcuts supported are:
Patterns are Unix shell style:
* matches everything
? matches any single character
[seq] matches any character in seq
[!seq] matches any char not in seq
Your regex is should be: 'file_[0-9][0-9][0-9][0-9]-[0-9][0-9]-[0-9][0-9].txt'
Alternatively, you can get started with re directly, without using fnmatch (take a code below as a starting point, but there is room for improvement: check whether a year is a valid year, a month is between 1-12 and a day is between 1 and 28,29,30, or 31):
import re
example_file = 'file_2010-09-15.txt'
myregex = 'file_\d\d\d\d-\d\d-\d\d\.txt'
result = re.match(myregex, example_file)
print(result.group(0))
I need a GLOB2 or FORMIC like solution to search a large list of directories in a text file (the files aren't on my machine, the file list is produced by an external process i cannot directly access or query)
pseudo example:
# read the large directory list in memory
data = []
with open('C:\\log_file.txt','r') as log:
data = log.readlines()
# query away!
query1 = listglob(data,'/**/fnord/*/log.*')
query2 = listglob(data,'/usr/*/model_*/fnord/**')
Unless someone has a suggestion, my next step is to open up glob2 and formic and see if one of them can be changed to accept a list instead of a root folder to be "os.walked"
I would recommend using regular expressions. Ultimately, both Formic and glob use an OS call to perform the actual glob matching. So, if you want to modify either, you're going to have to write a RE matcher (or similar) in any case. So, cut out the middle-man and go straight to REs. (It pains me to say that because I'm the author of Formic).
The basic plan is to write a function that takes in your glob and returns a regular expression. Here are some pointers:
Escape and ., - and other RE reserved characters in your globs. Eg . becomes \.
A ? in a glob file/directory becomes [^/] (matches a single character that's not a /)
A * in a glob file/directory name as a regular expression is [^/]*
A /*/ glob as a regular expression is: /[^/]+/
A /**/ glob as a regular expression is: /([^/]+/)*
To match a whole line, start the RE with a ^ and end it with $. This forces the RE to expand over the whole string.
While I listed the substitutions in order of increasing complexity, it's probably a good idea to do the substitutions in the following order:
Special RE characters that are not globs (., -, '$', etc)
?
/**/
/*/
*
This way you won't corrupt the /**/ when substituting for a single *.
In your question you have: /**/fnord/*/log.*. This would map to:
^/([^/]+/)*fnord/[^/]+/log\.[^/]*
Once you've built your RE, then finding matches is a simple exercise.
In the end i used one of glob2's functions, like so:
import glob2
def listglob(data,pattern):
return [x for x in items if glob2.fnmatch.fnmatch(x,pattern)]
I dont think the glob2.fnmatch.fnmatch is equivalent to the glob2 ** syntax.
It is equivalent to the fnmatch syntax from what i can tell from reading the source code.
Also Andrew's answer doesn't cover the square brackets. and the [!abc] example
What is the regular expression to match strings (in this case, file names) that start with 'Run' and have a filename extension of '.py'?
The regular expression should match any of the following:
RunFoo.py
RunBar.py
Run42.py
It should not match:
myRunFoo.py
RunBar.py1
Run42.txt
The SQL equivalent of what I am looking for is ... LIKE 'Run%.py' ....
For a regular expression, you would use:
re.match(r'Run.*\.py$')
A quick explanation:
. means match any character.
* means match any repetition of the previous character (hence .* means any sequence of chars)
\ is an escape to escape the explicit dot
$ indicates "end of the string", so we don't match "Run_foo.py.txt"
However, for this task, you're probably better off using simple string methods. ie.
filename.startswith("Run") and filename.endswith(".py")
Note: if you want case insensitivity (ie. matching "run.PY" as well as "Run.py", use the re.I option to the regular expression, or convert to a specific case (eg filename.lower()) before using string methods.
I don't really understand why you're after a regular expression to solve this 'problem'. You're just after a way to find all .py files that start with 'Run'. So this is a simple solution that will work, without resorting to compiling an running a regular expression:
import os
for filename in os.listdir(dirname):
root, ext = os.path.splitext(filename)
if root.startswith('Run') and ext == '.py':
print filename
Warning:
jobscry's answer ("^Run.?.py$") is incorrect (will not match "Run123.py", for example).
orlandu63's answer ("/^Run[\w]*?.py$/") will not match "RunFoo.Bar.py".
(I don't have enough reputation to comment, sorry.)
/^Run.*\.py$/
Or, in python specifically:
import re
re.match(r"^Run.*\.py$", stringtocheck)
This will match "Runfoobar.py", but not "runfoobar.PY". To make it case insensitive, instead use:
re.match(r"^Run.*\.py$", stringtocheck, re.I)
You don't need a regular expression, you can use glob, which takes wildcards e.g. Run*.py
For example, to get those files in your current directory...
import os, glob
files = glob.glob( "".join([ os.getcwd(), "\\Run*.py"]) )
If you write a slightly more complex regular expression, you can get an extra feature: extract the bit between "Run" and ".py":
>>> import re
>>> regex = '^Run(?P<name>.*)\.py$'
>>> m = re.match(regex, 'RunFoo.py')
>>> m.group('name')
'Foo'
(the extra bit is the parentheses and everything between them, except for '.*' which is as in Rob Howard's answer)
This probably doesn't fully comply with file-naming standards, but here it goes:
/^Run[\w]*?\.py$/
mabye:
^Run.*\.py$
just a quick try