Regex to extract file paths except urls - python

I have a large text containing some file paths, and I need a regex which can help me extract all the paths. Currently I'm using this one:
\/.+?\/[\w]+\.\w+
It works almost perfectly, but links containing filename or a dot at the end are interpreted as paths too, like this one:
http://example.com/index.html
Help in providing a valid regular expression is highly appreciated. Also if you can add support of spaces in paths in this regex, it would be awesome. Thanks in advance!
Link to regex test: click

You could try a negative look-behind to exclude the "http:" and "https:" prefix.
(?<!https:)(?<!http:)(?<!/)(?<!\w)((/[^\s]+)?/\w+\.\w+)
If you try it with this test strings in pythex:
/abc/def/def.ps
/abc/def/ttt/def.ps
/test.txt
/abc/test.txt http://example.com/index.html
http://www.google.com/bla/test/index.html https://www.google.com/bla/test/index.html
It will only match the first 4.
Here is the pythex link.
The advantage of this regular expression is that is does not rely on the beginning of the line to work.
You can add as many look behinds as you wish to support other protocols, like ftp, etc.

Try this : ^\/.+?\/[\w]+\.\w+$ with multi-line mode enabled.

Related

Regex expression to match last numerical component, but exclude file extension

I'm stumped trying to figure out a regex expression. Given a file path, I need to match the last numerical component of the path ("frame" number in an image sequence), but also ignore any numerical component in the file extension.
For example, given path:
/path/to/file/abc123/GCAM5423.xmp
The following expression will correctly match 5423.
((?P<index>(?P<padding>0*)\d+)(?!.*(0*)\d+))
However, this expression fails if for example the file extension contains a number as follows:
/path/to/file/abc123/GCAM5423.cr2
In this case the expression will match the 2 in the file extension, when I still need it to match 5423. How can I modify the above expression to ignore file extensions that have a numerical component?
Using python flavor of regex. Thanks in advance!
Edit: Thanks all for your help! To clarify, I specifically need to modify the above expression to only capture the last group. I am passing this pattern to an external library so it needs to include the named groups and to only match the last number prior to the extension.
You can try this one:
\/[a-zA-Z]*(\d*)\.[a-zA-Z0-9]{3,4}$
Try this pattern:
\/[^/\d\s]+(\d+)\.[^/]+$
See Regex Demo
Code:
import re
pattern = r"\/[^/\d\s]+(\d+)\.[^/]+$"
texts = ['/path/to/file/abc123/GCAM5423.xmp', '/path/to/file/abc123/GCAM5423.cr2']
print([match.group(1) for x in texts if (match := re.search(pattern, x))])
Output:
['5423', '5423']
Step1: Find substring before last dot.
(.*)\.
Input: /path/to/file/abc123/GCAM5423.cr2
Output: /path/to/file/abc123/GCAM5423
Step2: Find the last numbers using your regex.
Input: /path/to/file/abc123/GCAM5423
Output: 5423
I don't know how to join these two regexs, but it also usefult for you. My hopes^_^

Regular expression in python to get the last occurence of a file extension in a URL or path

Given a long url or path how do I get the last file extension in it. For example consider these two strings.
url = 'https://image.freepik.com/free-vector/vector-chickens-full-emotions_75487-787.jpg?x=2'
path = './image.freepik.com/free-vector/vector-chickens-full-emotions_75487-787.abc.jpg'
The last extension is jpg and comes after the last . and before the following non-alphanumerics or end-of-string.
There are similar questions to mine but I can't find an exact match.
re.search('\.(\w+)(?!.*\.)', url).group(1)
Use negative lookahead to search for matches that aren't followed by dots
Parsing rules are different for FILENAMES, and URLS - so don't make a single REGEX to do that, its not simple and not worth your time.
Instead, make a test of some sort - to determine what type of object you are looking at, ie: This IS or ISNOT a URL. This could be as simple as: Does it start with http://, then it is a URL.. if not ... it is not a URL
Then apply the specific rule to the specific type.
Always make use of standard tools, they have often already figured out the corner cases or things you will forget.
The URL parser: https://docs.python.org/3/library/urllib.parse.html
Then, for files use: os.path.splitext(path)
in the standard python library: https://docs.python.org/3/library/os.path.html

Python regex to exclude several words

I try to search for URLS and want to exclude some. In the variable download_artist I stored the base URL and wanto to find additional links, but not upload, favorites, followers or listens.
So I tried different versions with the mentioned words and a |. Like:
urls = re.findall(rf'^{download_artist}uploads/|{download_artist}^favorites/|^{download_artist}followers/|^{download_artist}listens/|{download_artist}\S+"', response.text, re.IGNORECASE)
or:
urls = re.findall(rf'{download_artist}^uploads/|^favorites/|^followers/|^listens/|\S+"', response.text, re.IGNORECASE)
But it ignores my ^ for excluding the words. Where is my mistake?
You need use "lookaround" in this case, can see more details in https://www.regular-expressions.info/lookaround.html.
So, i think wich this regex solve your problem:
{download_artist}(?!uploads/|favorites/|followers/|listens/)\S+\"
You can test if regex working in https://regex101.com/. This site is very useful when you work with regex.
^ only works as a negation in character classes inside [], outside it represents the beginning of the input.
I suggest you do two matches: One to match all urls and another one to match the ones to exclude. Then remove the second set of urls from the first one.
That will keep the regexes simple and readable.
If you have to do it in one regex for whatever reason you can try to solve it with (negative) lookaround pattern (see https://www.rexegg.com/regex-lookarounds.html).

Regex pattern to match two datetime formats

I am doing a directory listening and need to get all directory names that follow the pattern: Feb14-2014 and 14022014-sometext. The directory names must not contain dots, so I dont want to match 14022014-sometext.more. Like you can see I want to match just the directories that follow the pattern %b%d-%Y and %d%m%Y-textofanylengthWithoutDots.
For the first case it should be something like [a-zA-Z]{3}\d{2}. I dont know how to parse the rest because my regex skills are poor, sorry. So I hope someone can tell me what the correct patterns look like. Thanks.
I am assuming each directory listing is separated by a new line
([A-Z]\w{2}\d{1,2}\-\d{4}|\d{7,8}\-\w+)$
Will match both cases and will match the text only if it is uninterrupted (by dots or anything else for that matter) until it hits the end of the line.
Some notes:
If you want to match everything except dot you may replace the final \w+ with [^.]+.
You need the multiline modifier /m for this to work, otherwise the $ will match the end of the string only.
I've not added a ^ to the start of the regex, but you may do so if each line contains a single directory
Of course you may expand this regex to include (Jan|Feb|Mar|Apr|May|Jun|Jul|Aug|Sep|Oct|Nov|Dec) instead of [A-Z]\w{2}. I've not done this to keep it readable. I would also suggest you store this in a python array and insert it dynamically into your regex for maintainability sake.
See it in action: http://regex101.com/r/pS6iY9
That's quite easy.
The best one I can make is:
((Jan|Feb|Mar|Apr|May|Jun|Jul|Aug|Sep|Oct|Nov|Dec)\d\d-\d\d\d\d)|(\d\d\d\d\d\d\d\d-\w+)
The first part ((Jan|Feb|Mar|Apr|May|Jun|Jul|Aug|Sep|Oct|Nov|Dec)\d\d-\d\d\d\d) matches the first kind of dates and the second part (\d\d\d\d\d\d\d\d-\w+) - the second kind.

De-greedifying a regular expression in python

I'm trying to write a regular expression that will convert a full path filename to a short filename for a given filetype, minus the file extension.
For example, I'm trying to get just the name of the .bar file from a string using
re.search('/(.*?)\.bar$', '/def_params/param_1M56/param/foo.bar')
According to the Python re docs, *? is the ungreedy version of *, so I was expecting to get
'foo'
returned for match.group(1) but instead I got
'def_params/param_1M56/param/foo'
What am I missing here about greediness?
What you're missing isn't so much about greediness as about regular expression engines: they work from left to right, so the / matches as early as possible and the .*? is then forced to work from there. In this case, the best regex doesn't involve greediness at all (you need backtracking for that to work; it will, but could take a really long time to run if there are a lot of slashes), but a more explicit pattern:
'/([^/]*)\.bar$'
I would suggest changing your regex so that it doesn't rely on greedyness.
You want only the filename before the extension .bar and everything after the final /. This should do:
re.search(`/[^/]*\.bar$`, '/def_params/param_1M56/param/foo.bar')
What this does is it matches /, then zero or more characters (as much as possible) that are not / and then .bar.
I don't claim to understand the non-greedy operators all that well, but a solution for that particular problem would be to use ([^/]*?)
The regular expressions starts from the right. Put a .* at the start and it should work.
I like regex but there is no need of one here.
path = '/def_params/param_1M56/param/foo.bar'
print path.rsplit('/',1)[1].rsplit('.')[0]
path = '/def_params/param_1M56/param/fululu'
print path.rsplit('/',1)[1].rsplit('.')[0]
path = '/def_params/param_1M56/param/one.before.two.dat'
print path.rsplit('/',1)[1].rsplit('.',1)[0]
result
foo
fululu
one.before.two
Other people have answered the regex question, but in this case there's a more efficient way than regex:
file_name = path[path.rindex('/')+1 : path.rindex('.')]
try this one on for size:
match = re.search('.*/(.*?).bar$', '/def_params/param_1M56/param/foo.bar')

Categories

Resources