Extract file name with a regular expression - python

I want to create a regular expressions to extract the filename of an url
https://example.net/img/src/img.jpg
I want to extract img1.jpg
I use urlparse from python but it extract the path in this way
img/src/img.jpg
How I can extract the file name with a regular expression

Using str.split and negative indexing
url = "https://example.net/img/src/img.jpg"
print(url.split("/")[-1])
Output:
img.jpg
or using os.path.basename
import urlparse, os
url = "https://example.net/img/src/img.jpg"
a = urlparse.urlparse(url)
print(os.path.basename(a.path)) #--->img.jpg

You can either use a split on / and select the last element of the returned array (the best solution in my opinion)
or if you really want to use a regex you can use the following one
(?<=\/)(?:(?:\w+\.)*\w+)$
Note that only the following filenames are accepted: DEMO
You can adapt and change the \w to accept other characters if necessary.
Explanations:
(?<=\/) positive lookbehind on / and $ add the constraint that the filename string is the last element of the path
(?:(?:\w+\.)*\w+) is used to extract words that are composed of several letters/digits and eventually underscores followed by a dot, this group can be repeated as many time as necessary (xxx.tar.gz file for example) and then followed by the final extension.

If your url pattern is static you can use positive lookahead ,
import re
pattern =r'\w+(?=\.jpg)'
text="""https://example.net/img/src/img.jpg
"""
print(re.findall(pattern,text)[0])
output:
img

Related

Regex expression to match last numerical component, but exclude file extension

I'm stumped trying to figure out a regex expression. Given a file path, I need to match the last numerical component of the path ("frame" number in an image sequence), but also ignore any numerical component in the file extension.
For example, given path:
/path/to/file/abc123/GCAM5423.xmp
The following expression will correctly match 5423.
((?P<index>(?P<padding>0*)\d+)(?!.*(0*)\d+))
However, this expression fails if for example the file extension contains a number as follows:
/path/to/file/abc123/GCAM5423.cr2
In this case the expression will match the 2 in the file extension, when I still need it to match 5423. How can I modify the above expression to ignore file extensions that have a numerical component?
Using python flavor of regex. Thanks in advance!
Edit: Thanks all for your help! To clarify, I specifically need to modify the above expression to only capture the last group. I am passing this pattern to an external library so it needs to include the named groups and to only match the last number prior to the extension.
You can try this one:
\/[a-zA-Z]*(\d*)\.[a-zA-Z0-9]{3,4}$
Try this pattern:
\/[^/\d\s]+(\d+)\.[^/]+$
See Regex Demo
Code:
import re
pattern = r"\/[^/\d\s]+(\d+)\.[^/]+$"
texts = ['/path/to/file/abc123/GCAM5423.xmp', '/path/to/file/abc123/GCAM5423.cr2']
print([match.group(1) for x in texts if (match := re.search(pattern, x))])
Output:
['5423', '5423']
Step1: Find substring before last dot.
(.*)\.
Input: /path/to/file/abc123/GCAM5423.cr2
Output: /path/to/file/abc123/GCAM5423
Step2: Find the last numbers using your regex.
Input: /path/to/file/abc123/GCAM5423
Output: 5423
I don't know how to join these two regexs, but it also usefult for you. My hopes^_^

How to extract certain pattern from a url using regex in Python?

I have some bunch of urls like below
https://data.hova.com/strap/nik/sql_output1574414532.89.zip
https://data.hova.com/strap/asr/sql_output1574414532.89.zip
https://data.hova.com/strap/olr/sql_output1574414532.89.zip
Now I want to extract just the zip file name ie sql_output1574414532.89.zip, sql_output1574414532.89.zip, sql_output1574414532.89.zip respectively.
Now I could have used a simple split to get the filenames but if you observe, the directory name before the zip file changes like nik, asr, olr etc.
So I want to use regex so that I only look at anything that starts with sql and ends with zip.
So this is what I did
import re
string = "https://data.hova.com/strap/nik/sql_output1574414532.89.zip"
pattern = r'^sql\.zip$'
match = re.search(pattern, string)
print(match)
But the match comes as None. What am I doing wrong?
The pattern r'^sql\.zip$' matches only one string: "sql.zip".
For your purpose you need something like sql.+zip$, or, if you expect that sql string can be encountered in URL before file name, change it to sql[^/]+zip$.

regex for finding file paths

I used this regex(\/.*\.[\w:]+) to find all file paths and directories. But in a line like this "file path /log/file.txt some lines /log/var/file2.txt" which contains two paths in the same line , it does not select the paths individually , rather , it selects the whole line. How to solve this?
Use regex(\/.*?\.[\w:]+) to make regex non-greedy. If you want to find multiple matches in the same line, you can use re.findall().
Update:
Using this code and the example provided, I get:
import re
re.findall(r'(\/.*?\.[\w:]+)', "file path /log/file.txt some lines /log/var/file2.txt")
['/log/file.txt', '/log/var/file2.txt']
Your regex (\/.*\.[\w:]+) uses .* which is greedy and would match [\w:]+ after the last dot in file2.txt. You could use .*? instead.
But it would also match /log////var////.txt
As an alternative you might use a repeating non greedy pattern that would match the directory structure (?:/[^/]+)+? followed by a part that matches the filename /\w+\.\w+
(?:/[^/]+)+?/\w+\.\w+
import re
s = "file path /log/file.txt some lines /log/var/file2.txt or /log////var////.txt"
print(re.findall(r'(?:/[^/]+)+?/\w+\.\w+', s))
That would result in:
['/log/file.txt', '/log/var/file2.txt']
Demo
You can use python re
something like this:
import re
msg="file path /log/file.txt some lines /log/var/file2.txt"
matches = re.findall("(/[a-zA-Z\./]*[\s]?)", msg)
print(matches)
Ref: https://docs.python.org/2/library/re.html#finding-all-adverbs

How to regex split, but keep the split string?

I have the following URL pattern:
http://www.hulu.jp/watch/589851/supernatural-dub-hollywood-babylon/en
I would like to get everything up until and inclusive of /watch/\d+/.
So far I have:
>>> re.split(r'watch/\d+/', 'http://www.hulu.jp/watch/589851/supernatural-dub-hollywood-babylon/en')
['http://www.hulu.jp/', 'supernatural-dub-hollywood-babylon/en']
But this does not include the split string (the string which appears between the domain and the path). The end answer I want to achieve is:
http://www.hulu.jp/watch/589851
You need to use capture group :
>>> re.split(r'(watch/\d+/)', 'http://www.hulu.jp/watch/589851/supernatural-dub-hollywood-babylon/en')
['http://www.hulu.jp/', 'watch/589851/', 'supernatural-dub-hollywood-babylon/en']
As mentioned in the other answer, you need to use groups to capture the "glue" between the split strings.
I wonder though, is what you want here a split() or a search()? It looks (from the sample) that you're trying to extract from a URL everything from the first occurrence of /watch/XXX/ where XXX is 1 or more digits, to the end of the string. If that's the case, then a match/search might be more suitable, as with a split if the search regex can match multiple times you'll split into multiple groups. Ex:
re.split(r'(watch/\d+/)', 'http://www.hulu.jp/watch/589851/supernatural-dub-hollywood-babylon/watch/2342/fdsaafsdf')
['http://www.hulu.jp/', 'watch/589851/', 'supernatural-dub-hollywood-babylon/', 'watch/2342/', 'fdsaafsdf']
Which doesn't look like what you want. Instead perhaps:
result = re.search(r'(watch/\d+/)(.*)', 'http://www.hulu.jp/watch/589851/supernatural-dub-hollywood-babylon/watch/2342/fdsaafsdf')
result.groups() if result else []
which gives:
('watch/589851/', 'supernatural-dub-hollywood-babylon/watch/2342/fdsaafsdf')
You could also use this approach combined with named groups to get extra fancy:
result = re.search(r'(?P<watchId>watch/\d+/)(?P<path>.*)', 'http://www.hulu.jp/watch/589851/supernatural-dub-hollywood-babylon/watch/2342/fdsaafsdf')
result.groupdict() if result else {}
giving:
{'path': 'supernatural-dub-hollywood-babylon/watch/2342/fdsaafsdf', 'watchId': 'watch/589851/'}
If you're set on the split() approach, you can also set the maxsplit parameter to ensure it's only split once:
re.split(r'(watch/\d+/)', 'http://www.hulu.jp/watch/589851/supernatural-dub-hollywood-babylon/watch/2342/fdsaafsdf', maxsplit=1)
giving:
['http://www.hulu.jp/', 'watch/589851/', 'supernatural-dub-hollywood-babylon/watch/2342/fdsaafsdf']
Personally though, I find that when parsing URL's into constituent parts the search() with named groups approach works extremely well as it allows you to name the various parts in the regex itself, and via groupdict() get a nice dictionary you can use for working with those parts.
You've surely seen the Stack Overflow don't-parse-HTML-with-regex post, yes?
You can't parse [X]HTML with regex. Because HTML can't be parsed by regex. Regex is not a tool that can be used to correctly parse HTML. As I have answered in HTML-and-regex questions here so many times before, the use of regex will not allow you to consume HTML.
Well, regex can parse URLs, but trying to do so when there's a plethora of better tools is foolish.
This is what a regex for URLs looks like:
^(?:(?:https?|ftp):\/\/)(?:\S+(?::\S*)?#)?(?:(?!10(?:\.\d{1,3}){3})(?!127(?:\.\d{1,3}){3})(?!169\.254(?:\.\d{1,3}){2})(?!192\.168(?:\.\d{1,3}){2})(?!172\.(?:1[6-9]|2\d|3[0-1])(?:\.\d{1,3}){2})(?:[1-9]\d?|1\d\d|2[01]\d|22[0-3])(?:\.(?:1?\d{1,2}|2[0-4]\d|25[0-5])){2}(?:\.(?:[1-9]\d?|1\d\d|2[0-4]\d|25[0-4]))|(?:(?:[a-z\x{00a1}-\x{ffff}0-9]+-?)*[a-z\x{00a1}-\x{ffff}0-9]+)(?:\.(?:[a-z\x{00a1}-\x{ffff}0-9]+-?)*[a-z\x{00a1}-\x{ffff}0-9]+)*(?:\.(?:[a-z\x{00a1}-\x{ffff}]{2,})))(?::\d{2,5})?(?:\/[^\s]*)?$ (+ caseless flag)
It's just a mess of characters, right? Exactly!
Don't parse URLs with regex... almost.
There is one simple thing:
A path-relative URL must be zero or more path segments separated from each other by a "/".
Splitting the URL should be as simple as url.split("/").
from urllib.parse import urlparse, urlunparse
myurl = "http://www.hulu.jp/watch/589851/supernatural-dub-hollywood-babylon/en"
# Run a parser over it
parts = urlparse(myurl)
# Crop the path to UP TO length 2
new_path = str("/".join(parts.path.split("/")[:3]))
# Unparse
urlunparse(parts._replace(path=new_path))
#>>> 'http://www.hulu.jp/watch/589851'
You can try following regex
.*\/watch\/\d+
Working Demo

Python regular expressions matching within set

While testing on http://gskinner.com/RegExr/ (online regex tester), the regex [jpg|bmp] returns results when either jpg or bmp exist, however, when I run this regex in python, it only return j or b. How do I make the regex take the whole word "jpg" or "bmp" inside the set ? This may have been asked before however I was not sure how to structure question to find the answer. Thanks !!!
Here is the whole regex if it helps
"http://www\S*(?i)\\.(jpg|bmp|png|gif|img|jng|jpeg|jpe|gif|giff)"
Its just basically to look for pictures in a url
Use (jpg|bmp) instead of square brackets.
Square brackets mean - match a character from the set in the square brackets.
Edit - you might want something like that: [^ ].*?(jpg|bmp) or [^ ].*?\.(jpg|bmp)
When you are using [] your are creating a character class that contains all characters between the brackets.
So your are not matching for jpg or bmp you are matching for either a j or a p or a g or a | ...
You should add an anchor for the end of the string to your regex
http://www\S*(?i)\\.(jpg|bmp|png|gif|img|jng|jpeg|jpe|gif|giff)$
^ ^^
if you need double escaping then every where in your pattern
http://www\\S*(?i)\\.(jpg|bmp|png|gif|img|jng|jpeg|jpe|gif|giff)$
to ensure that it checks for the file ending at the very end of the string.
If you are searching a list of URLs
urls = [ 'http://some.link.com/path/to/file.jpg',
'http://some.link.com/path/to/another.png',
'http://and.another.place.com/path/to/not-image.txt',
]
to find ones that match a given pattern you can use:
import re
for url in urls:
if re.match(r'http://.*(jpg|png|gif)$'):
print url
which will output
http://some.link.com/path/to/file.jpg
http://some.link.com/path/to/another.png
re.match() will test for a match at the start of the string and return a match object for the first two links, and None for the third.
If you are getting just the extension, you can use the following:
for url in urls:
m = re.match(r'http://.*(jpg|png|gif)$')
print m.group(0)
which will print
('jpg',)
('png',)
You will get just the extensions because that's what was defined as a group.
If you need to find the url in a long string of text (such as returned from wget), you need to use re.search() and enclose the part you are interested in with ( )'s. For example,
response = """dlkjkd dkjfadlfjkd fkdfl kadfjlkadfald ljkdskdfkl adfdf
kjakldjflkhttp://some.url.com/path/to/file.jpgkaksdj fkdjakjflakdjfad;kadj af
kdlfjd dkkf aldfkaklfakldfkja df"""
reg = re.search(r'(http:.*/(.*\.(jpg|png|gif)))', response)
print reg.groups()
will print
('http://some.url.com/path/to/file.jpg', 'file.jpg', 'jpg',)
or you can use re.findall or re.finditer in place of re.search to get all of the URL's in the long response. Search will only return the first one.

Categories

Resources