regex for finding file paths

regex for finding file paths - python

I used this regex(\/.*\.[\w:]+) to find all file paths and directories. But in a line like this "file path /log/file.txt some lines /log/var/file2.txt" which contains two paths in the same line , it does not select the paths individually , rather , it selects the whole line. How to solve this?

Use regex(\/.*?\.[\w:]+) to make regex non-greedy. If you want to find multiple matches in the same line, you can use re.findall().
Update:
Using this code and the example provided, I get:
import re
re.findall(r'(\/.*?\.[\w:]+)', "file path /log/file.txt some lines /log/var/file2.txt")
['/log/file.txt', '/log/var/file2.txt']

Your regex (\/.*\.[\w:]+) uses .* which is greedy and would match [\w:]+ after the last dot in file2.txt. You could use .*? instead.
But it would also match /log////var////.txt
As an alternative you might use a repeating non greedy pattern that would match the directory structure (?:/[^/]+)+? followed by a part that matches the filename /\w+\.\w+
(?:/[^/]+)+?/\w+\.\w+
import re
s = "file path /log/file.txt some lines /log/var/file2.txt or /log////var////.txt"
print(re.findall(r'(?:/[^/]+)+?/\w+\.\w+', s))
That would result in:
['/log/file.txt', '/log/var/file2.txt']
Demo

You can use python re
something like this:
import re
msg="file path /log/file.txt some lines /log/var/file2.txt"
matches = re.findall("(/[a-zA-Z\./]*[\s]?)", msg)
print(matches)
Ref: https://docs.python.org/2/library/re.html#finding-all-adverbs

Related

Regex expression to match last numerical component, but exclude file extension

I'm stumped trying to figure out a regex expression. Given a file path, I need to match the last numerical component of the path ("frame" number in an image sequence), but also ignore any numerical component in the file extension.
For example, given path:
/path/to/file/abc123/GCAM5423.xmp
The following expression will correctly match 5423.
((?P<index>(?P<padding>0*)\d+)(?!.*(0*)\d+))
However, this expression fails if for example the file extension contains a number as follows:
/path/to/file/abc123/GCAM5423.cr2
In this case the expression will match the 2 in the file extension, when I still need it to match 5423. How can I modify the above expression to ignore file extensions that have a numerical component?
Using python flavor of regex. Thanks in advance!
Edit: Thanks all for your help! To clarify, I specifically need to modify the above expression to only capture the last group. I am passing this pattern to an external library so it needs to include the named groups and to only match the last number prior to the extension.

You can try this one:
\/[a-zA-Z]*(\d*)\.[a-zA-Z0-9]{3,4}$

Try this pattern:
\/[^/\d\s]+(\d+)\.[^/]+$
See Regex Demo
Code:
import re
pattern = r"\/[^/\d\s]+(\d+)\.[^/]+$"
texts = ['/path/to/file/abc123/GCAM5423.xmp', '/path/to/file/abc123/GCAM5423.cr2']
print([match.group(1) for x in texts if (match := re.search(pattern, x))])
Output:
['5423', '5423']

Step1: Find substring before last dot.
(.*)\.
Input: /path/to/file/abc123/GCAM5423.cr2
Output: /path/to/file/abc123/GCAM5423
Step2: Find the last numbers using your regex.
Input: /path/to/file/abc123/GCAM5423
Output: 5423
I don't know how to join these two regexs, but it also usefult for you. My hopes^_^

Extract file name with a regular expression

I want to create a regular expressions to extract the filename of an url
https://example.net/img/src/img.jpg
I want to extract img1.jpg
I use urlparse from python but it extract the path in this way
img/src/img.jpg
How I can extract the file name with a regular expression

Using str.split and negative indexing
url = "https://example.net/img/src/img.jpg"
print(url.split("/")[-1])
Output:
img.jpg
or using os.path.basename
import urlparse, os
url = "https://example.net/img/src/img.jpg"
a = urlparse.urlparse(url)
print(os.path.basename(a.path)) #--->img.jpg

You can either use a split on / and select the last element of the returned array (the best solution in my opinion)
or if you really want to use a regex you can use the following one
(?<=\/)(?:(?:\w+\.)*\w+)$
Note that only the following filenames are accepted: DEMO
You can adapt and change the \w to accept other characters if necessary.
Explanations:
(?<=\/) positive lookbehind on / and $ add the constraint that the filename string is the last element of the path
(?:(?:\w+\.)*\w+) is used to extract words that are composed of several letters/digits and eventually underscores followed by a dot, this group can be repeated as many time as necessary (xxx.tar.gz file for example) and then followed by the final extension.

If your url pattern is static you can use positive lookahead ,
import re
pattern =r'\w+(?=\.jpg)'
text="""https://example.net/img/src/img.jpg
"""
print(re.findall(pattern,text)[0])
output:
img

Match Naming Convention of Text File Regex

I am trying to find text files with a certain naming convention using regex, but have so far been unsuccessful.
The naming convention is file_[year]-[month]-[day].txt (eg. file_2010-09-15.txt).
Here is what I have so far: ^(file_)[0-9]{4}[-][0-9]{2}[-][0-9]{2}(\.txt)$
I'm trying to use it in my code like this:
for text_file in os.listdir(path):
if fnmatch.fnmatch(text_file, '^(file_)[0-9]{4}[-][0-9]{2}[-][0-9]{2}(\.txt)$'):
# print number of files found

I think the issue is because of the pattern type that fnmatch is expecting. In the documents it states the following:
This module provides support for Unix shell-style wildcards, which are not the same as regular expressions (which are documented in the re module). The special characters used in shell-style wildcards are:
Pattern Meaning
* matches everything
? matches any single character
[seq] matches any character in seq
[!seq] matches any character not in seq
`
You could keep it the way it is and just change it to that style of support, i.e.:
for text_file in os.listdir(path):
if fnmatch.fnmatch(text_file, 'file_[0-9][0-9][0-9][0-9]-[0-9][0-9]-[0-9][0-9].txt'):
# print number of files found
Or what I would suggest is using re.match like so:
regex = re.compile(r'^(file_)[0-9]{4}[-][0-9]{2}[-][0-9]{2}(\.txt)$')
for text_file in os.listdir(path):
if regex.match(text_file):
# print the text file

The fnmatch translates the regex to re python module. Take a look at the source code here. Basically, the shortcuts supported are:
Patterns are Unix shell style:
* matches everything
? matches any single character
[seq] matches any character in seq
[!seq] matches any char not in seq
Your regex is should be: 'file_[0-9][0-9][0-9][0-9]-[0-9][0-9]-[0-9][0-9].txt'
Alternatively, you can get started with re directly, without using fnmatch (take a code below as a starting point, but there is room for improvement: check whether a year is a valid year, a month is between 1-12 and a day is between 1 and 28,29,30, or 31):
import re
example_file = 'file_2010-09-15.txt'
myregex = 'file_\d\d\d\d-\d\d-\d\d\.txt'
result = re.match(myregex, example_file)
print(result.group(0))

Look for files not having a given extension

I'v tried the following code.
import re
regobj = re.compile(r"^.+\.(oth|xyz)$")
for test in ["text.txt", "other.oth", "abc.xyz"]:
if regobj.match(test):
print("Method 1:", test)
regobj = re.compile(r"^.+\.[^txt]$")
for test in ["text.txt", "other.oth", "abc.xyz"]:
if regobj.match(test):
print("Method 2:", test)
I would like that the 2nd method finds any file not having the extension txt but the way I try is not the good one. What am I doing wrong ?

Regular expressions are overkill here. Use the str.endswith() method:
if not str.endswith('.txt'):
Your regular expression uses a negative character class, which is a set of characters that should not be matched. Anything that is not a t or x will satisfy that test. You could have explicitly matched .txt and used not to exclude rather than include:
regobj = re.compile(r"^.+\.txt$")
if not regobj.match(test):
If all you can use is regular expressions, use negative look-ahead assertions;
regobj = re.compile(r"^[^.]+\.(?!txt$)[^.]+$")
Here (?!...) only matches locations where there is no literal txt following, all the way to the end of the string. The [^.]+ then matches any number of characters that is not a . character until the end of the string.

Change the second regex to,
regobj = re.compile(r"^.+\.(?!txt$)[^.]+$")
[^txt] matches any character not of t or x. (?!txt$) asserts that the dot won't be followed by txt . And the [^.]+ after \. asserts that there must be atleast one single char must exists just after to the dot. So this matches the filenames which has any extensions but not of .txt

As Martijn Pieters mentioned regex is overkill, considering there are other more efficient ways:
fileName, fileExt = os.path.splitext(string)
Using splitext it's simple to isolate the extension.
import os
fileDict = ["text.txt", "other.oth", "abc.xyz"]
matchExt = ".txt"
for eachFile in fileDict:
fileName, fileExt = os.path.splitext(eachFile)
if matchExt not in fileExt:
print("(not %s) %s %s" % (matchExt, fileExt, fileName))
You can easily add an else statement to match other extensions, which I'll leave up to you.

Matching "~" at the end of a filename with a python regular expression

I'm working in a script (Python) to find some files. I compare names of files against a regular expression pattern. Now, I have to find files ending with a "~" (tilde), so I built this regex:
if re.match("~$", string_test):
print "ok!"
Well, Python doesn't seem to recognize the regex, I don't know why. I tried the same regex in other languages and it works perfectly, any idea?
PD: I read in a web that I have to insert
# -*- coding: utf-8 -*-
but doesn't help :( .
Thanks a lot, meanwhile I'm going to keep reading to see if a find something.

re.match() is only successful if the regular expression matches at the beginning of the input string. To search for any substring, use re.search() instead:
if re.search("~$", string_test):
print "ok!"

Your regex will only match strings "~" and (believe it or not) "~\n".
You need re.match(r".*~$", whatever) ... that means zero or more of (anything except a newline) followed by a tilde followed by (end-of-string or a newline preceding the end of string).
In the unlikely event that a filename can include a newline, use the re.DOTALL flag and use \Z instead of $.
"worked" in other languages: you must have used a search function.
r at the beginning of a string constant means raw escapes e.g. '\n' is a newline but r'\n' is two characters, a backslash followed by n -- which can also be represented by '\n'. Raw escapes save a lot of \\ in regexes, one should use r"regex" automatically
BTW: in this case avoid the regex confusion ... use whatever.endswith('~')

For finding files, use glob instead,
import os
import glob
path = '/path/to/files'
os.chdir(path)
files = glob.glob('./*~')
print files

The correct regex and the glob solution have already been posted. Another option is to use the fnmatch module:
import fnmatch
if fnmatch.fnmatch(string_test, "*~"):
print "ok!"
This is a tiny bit easier than using a regex. Note that all methods posted here are essentially equivalent: fnmatch is implemented using regular expressions, and glob in turn uses fnmatch.
Note that only in 2009 a patch was added to fnmatch (after six years!) that added support for file names with newlines.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

regex for finding file paths - python

I used this regex(\/.*\.[\w:]+) to find all file paths and directories. But in a line like this "file path /log/file.txt some lines /log/var/file2.txt" which contains two paths in the same line , it does not select the paths individually , rather , it selects the whole line. How to solve this?

You can use python re something like this: import re msg="file path /log/file.txt some lines /log/var/file2.txt" matches = re.findall("(/[a-zA-Z\./]*[\s]?)", msg) print(matches) Ref: https://docs.python.org/2/library/re.html#finding-all-adverbs

Related

Regex expression to match last numerical component, but exclude file extension

Extract file name with a regular expression

Match Naming Convention of Text File Regex

Look for files not having a given extension

Matching "~" at the end of a filename with a python regular expression

Categories

Resources