I want to split file’s extension by using regular expression - python

item is a string like "./test/test1.csv" .
I want to change item into "test1".
I wrote code,
item=re.search('./*.csv',item)
But,"1.csv" is item.I really cannot understand why such a thing happens.What should I do to do my ideal thing?

As a regex, './*.csv' does not mean what you think it does. A . means "any char" and a * means "zero or more of what came before", Thus, it's not "dot, slash, any string, dot, csv", but "any char, some slashes, any char csv".
If you really want to use a regex, you could try, e.g., this (among many other variants):
>>> re.search(r"([^/]+)\.[^\.]+$", p).group(1)
'test1'
Or just use str.split and rsplit:
>>> p.rsplit("/", 1)[-1].split(".")[0]
'test1'
Or, since you are handling file paths, how about os.path?
>>> os.path.splitext(os.path.split(p)[1])[0]
'test1'

Usually you want the os.path module from the standard library for this kind of filename.
import os.path
print(os.path.splitext(os.path.basename('./test/test1.csv'))[0])
In your regular expression version of this, remember that . matches any character (not just periods), x* matches any number of x's (even zero), and that re.search will return true if the pattern matches anywhere in the string: your regular expression matches whenever a filename contains the letters "csv" anywhere later than the first or second character. A correct regular expression implementation might be
import re
print(re.search(r'/([^/.]+)\.[^/]+$', './test/test1.csv')[1])
(matching a slash, at least one character that is neither a period nor a slash, a period, at least one character that is not a slash, and end of string). (IMHO os.path is more readable and maintainable.)

This will work
import ntpath
path = ".../test/test1.csv"
file_name = ntpath.basename(path)

Related

Look for files not having a given extension

I'v tried the following code.
import re
regobj = re.compile(r"^.+\.(oth|xyz)$")
for test in ["text.txt", "other.oth", "abc.xyz"]:
if regobj.match(test):
print("Method 1:", test)
regobj = re.compile(r"^.+\.[^txt]$")
for test in ["text.txt", "other.oth", "abc.xyz"]:
if regobj.match(test):
print("Method 2:", test)
I would like that the 2nd method finds any file not having the extension txt but the way I try is not the good one. What am I doing wrong ?
Regular expressions are overkill here. Use the str.endswith() method:
if not str.endswith('.txt'):
Your regular expression uses a negative character class, which is a set of characters that should not be matched. Anything that is not a t or x will satisfy that test. You could have explicitly matched .txt and used not to exclude rather than include:
regobj = re.compile(r"^.+\.txt$")
if not regobj.match(test):
If all you can use is regular expressions, use negative look-ahead assertions;
regobj = re.compile(r"^[^.]+\.(?!txt$)[^.]+$")
Here (?!...) only matches locations where there is no literal txt following, all the way to the end of the string. The [^.]+ then matches any number of characters that is not a . character until the end of the string.
Change the second regex to,
regobj = re.compile(r"^.+\.(?!txt$)[^.]+$")
[^txt] matches any character not of t or x. (?!txt$) asserts that the dot won't be followed by txt . And the [^.]+ after \. asserts that there must be atleast one single char must exists just after to the dot. So this matches the filenames which has any extensions but not of .txt
As Martijn Pieters mentioned regex is overkill, considering there are other more efficient ways:
fileName, fileExt = os.path.splitext(string)
Using splitext it's simple to isolate the extension.
import os
fileDict = ["text.txt", "other.oth", "abc.xyz"]
matchExt = ".txt"
for eachFile in fileDict:
fileName, fileExt = os.path.splitext(eachFile)
if matchExt not in fileExt:
print("(not %s) %s %s" % (matchExt, fileExt, fileName))
You can easily add an else statement to match other extensions, which I'll leave up to you.

Python regex example

If I want to replace a pattern in the following statement structure:
cat&345;
bat &#hut;
I want to replace elements starting from & and ending before (not including ;). What is the best way to do so?
Including or not including the & in the replacement?
>>> re.sub(r'&.*?(?=;)','REPL','cat&345;') # including
'catREPL;'
>>> re.sub(r'(?<=&).*?(?=;)','REPL','bat &#hut;') # not including
'bat &REPL;'
Explanation:
Although not required here, use a r'raw string' to prevent having to escape backslashes which often occur in regular expressions.
.*? is a "non-greedy" match of anything, which makes the match stop at the first semicolon.
(?=;) the match must be followed by a semicolon, but it is not included in the match.
(?<=&) the match must be preceded by an ampersand, but it is not included in the match.
Here is a good regex
import re
result = re.sub("(?<=\\&).*(?=;)", replacementstr, searchText)
Basically this will put the replacement in between the & and the ;
Maybe go a different direction all together and use HTMLParser.unescape(). The unescape() method is undocumented, but it doesn't appear to be "internal" because it doesn't have a leading underscore.
You can use negated character classes to do this:
import re
st='''\
cat&345;
bat &#hut;'''
for line in st.splitlines():
print line
print re.sub(r'([^&]*)&[^;]*;',r'\1;',line)

Matching "~" at the end of a filename with a python regular expression

I'm working in a script (Python) to find some files. I compare names of files against a regular expression pattern. Now, I have to find files ending with a "~" (tilde), so I built this regex:
if re.match("~$", string_test):
print "ok!"
Well, Python doesn't seem to recognize the regex, I don't know why. I tried the same regex in other languages and it works perfectly, any idea?
PD: I read in a web that I have to insert
# -*- coding: utf-8 -*-
but doesn't help :( .
Thanks a lot, meanwhile I'm going to keep reading to see if a find something.
re.match() is only successful if the regular expression matches at the beginning of the input string. To search for any substring, use re.search() instead:
if re.search("~$", string_test):
print "ok!"
Your regex will only match strings "~" and (believe it or not) "~\n".
You need re.match(r".*~$", whatever) ... that means zero or more of (anything except a newline) followed by a tilde followed by (end-of-string or a newline preceding the end of string).
In the unlikely event that a filename can include a newline, use the re.DOTALL flag and use \Z instead of $.
"worked" in other languages: you must have used a search function.
r at the beginning of a string constant means raw escapes e.g. '\n' is a newline but r'\n' is two characters, a backslash followed by n -- which can also be represented by '\n'. Raw escapes save a lot of \\ in regexes, one should use r"regex" automatically
BTW: in this case avoid the regex confusion ... use whatever.endswith('~')
For finding files, use glob instead,
import os
import glob
path = '/path/to/files'
os.chdir(path)
files = glob.glob('./*~')
print files
The correct regex and the glob solution have already been posted. Another option is to use the fnmatch module:
import fnmatch
if fnmatch.fnmatch(string_test, "*~"):
print "ok!"
This is a tiny bit easier than using a regex. Note that all methods posted here are essentially equivalent: fnmatch is implemented using regular expressions, and glob in turn uses fnmatch.
Note that only in 2009 a patch was added to fnmatch (after six years!) that added support for file names with newlines.

Forward slash in a Python regex

I'm trying to use a Python regex to find a mathematical expression in a string. The problem is that the forward slash seems to do something unexpected. I'd have thought that [\w\d\s+-/*]* would work for finding math expressions, but it finds commas too for some reason. A bit of experimenting reveals that forward slashes are the culprit. For example:
>>> import re
>>> re.sub(r'[/]*', 'a', 'bcd')
'abacada'
Apparently forward slashes match between characters (even when it is in a character class, though only when the asterisk is present). Back slashes do not escape them. I've hunted for a while and not found any documentation on it. Any pointers?
Look here for documentation on Python's re module.
I think it is not the /, but rather the - in your first character class: [+-/] matches +, / and any ASCII value between, which happen to include the comma.
Maybe this hint from the docs help:
If you want to include a ']' or a '-' inside a set, precede it with a backslash, or place it as the first character.
You are saying it to replace zero or more slashes with 'a'. So it does replace each "no character" with 'a'. :)
You probably meant [/]+, i.e. one or more slashes.
EDIT: Read Ber's answer for a solution to the original problem. I didn't read the whole question carefully enough.
r'[/]*' means "Match 0 or more forward-slashes". There are exactly 0 forward-slashes between 'b' & 'c' and between 'c' & 'd'. Hence, those matches are replaced with 'a'.
The * matches its argument zero or more times, and thus matches the empty string. The empty string is (logically) between any two consecutive characters. Hence
>>> import re
>>> re.sub(r'x*', 'a', 'bcd')
'abacada'
As for the forward slash, it receives no special treatment:
>>> re.sub(r'/', 'a', 'b/c/d')
'bacad'
The documentation describes the syntax of regular expressions in Python. As you can see, the forward slash has no special function.
The reason that [\w\d\s+-/*]* also finds comma's, is because inside square brackets the dash - denotes a range. In this case you don't want all characters between + and /, but a the literal characters +, - and /. So write the dash as the last character: [\w\d\s+/*-]*. That should fix it.

Regular expression to match start of filename and filename extension

What is the regular expression to match strings (in this case, file names) that start with 'Run' and have a filename extension of '.py'?
The regular expression should match any of the following:
RunFoo.py
RunBar.py
Run42.py
It should not match:
myRunFoo.py
RunBar.py1
Run42.txt
The SQL equivalent of what I am looking for is ... LIKE 'Run%.py' ....
For a regular expression, you would use:
re.match(r'Run.*\.py$')
A quick explanation:
. means match any character.
* means match any repetition of the previous character (hence .* means any sequence of chars)
\ is an escape to escape the explicit dot
$ indicates "end of the string", so we don't match "Run_foo.py.txt"
However, for this task, you're probably better off using simple string methods. ie.
filename.startswith("Run") and filename.endswith(".py")
Note: if you want case insensitivity (ie. matching "run.PY" as well as "Run.py", use the re.I option to the regular expression, or convert to a specific case (eg filename.lower()) before using string methods.
I don't really understand why you're after a regular expression to solve this 'problem'. You're just after a way to find all .py files that start with 'Run'. So this is a simple solution that will work, without resorting to compiling an running a regular expression:
import os
for filename in os.listdir(dirname):
root, ext = os.path.splitext(filename)
if root.startswith('Run') and ext == '.py':
print filename
Warning:
jobscry's answer ("^Run.?.py$") is incorrect (will not match "Run123.py", for example).
orlandu63's answer ("/^Run[\w]*?.py$/") will not match "RunFoo.Bar.py".
(I don't have enough reputation to comment, sorry.)
/^Run.*\.py$/
Or, in python specifically:
import re
re.match(r"^Run.*\.py$", stringtocheck)
This will match "Runfoobar.py", but not "runfoobar.PY". To make it case insensitive, instead use:
re.match(r"^Run.*\.py$", stringtocheck, re.I)
You don't need a regular expression, you can use glob, which takes wildcards e.g. Run*.py
For example, to get those files in your current directory...
import os, glob
files = glob.glob( "".join([ os.getcwd(), "\\Run*.py"]) )
If you write a slightly more complex regular expression, you can get an extra feature: extract the bit between "Run" and ".py":
>>> import re
>>> regex = '^Run(?P<name>.*)\.py$'
>>> m = re.match(regex, 'RunFoo.py')
>>> m.group('name')
'Foo'
(the extra bit is the parentheses and everything between them, except for '.*' which is as in Rob Howard's answer)
This probably doesn't fully comply with file-naming standards, but here it goes:
/^Run[\w]*?\.py$/
mabye:
^Run.*\.py$
just a quick try

Categories

Resources