Regex expression to match last numerical component, but exclude file extension - python

I'm stumped trying to figure out a regex expression. Given a file path, I need to match the last numerical component of the path ("frame" number in an image sequence), but also ignore any numerical component in the file extension.
For example, given path:
/path/to/file/abc123/GCAM5423.xmp
The following expression will correctly match 5423.
((?P<index>(?P<padding>0*)\d+)(?!.*(0*)\d+))
However, this expression fails if for example the file extension contains a number as follows:
/path/to/file/abc123/GCAM5423.cr2
In this case the expression will match the 2 in the file extension, when I still need it to match 5423. How can I modify the above expression to ignore file extensions that have a numerical component?
Using python flavor of regex. Thanks in advance!
Edit: Thanks all for your help! To clarify, I specifically need to modify the above expression to only capture the last group. I am passing this pattern to an external library so it needs to include the named groups and to only match the last number prior to the extension.

You can try this one:
\/[a-zA-Z]*(\d*)\.[a-zA-Z0-9]{3,4}$

Try this pattern:
\/[^/\d\s]+(\d+)\.[^/]+$
See Regex Demo
Code:
import re
pattern = r"\/[^/\d\s]+(\d+)\.[^/]+$"
texts = ['/path/to/file/abc123/GCAM5423.xmp', '/path/to/file/abc123/GCAM5423.cr2']
print([match.group(1) for x in texts if (match := re.search(pattern, x))])
Output:
['5423', '5423']

Step1: Find substring before last dot.
(.*)\.
Input: /path/to/file/abc123/GCAM5423.cr2
Output: /path/to/file/abc123/GCAM5423
Step2: Find the last numbers using your regex.
Input: /path/to/file/abc123/GCAM5423
Output: 5423
I don't know how to join these two regexs, but it also usefult for you. My hopes^_^

Related

python regex: match everything inside brackets including other brackets [duplicate]

In python, I can easily search for the first occurrence of a regex within a string like this:
import re
re.search("pattern", "target_text")
Now I need to find the last occurrence of the regex in a string, this doesn't seems to be supported by re module.
I can reverse the string to "search for the first occurrence", but I also need to reverse the regex, which is a much harder problem.
I can also iterate to find all occurrences from left to right, and just keep the last one, but that looks awkward.
Is there a smart way to find the rightmost occurrence?
One approach is to prefix the regex with (?s:.*) and force the engine to try matching at the furthest position and gradually backing off:
re.search("(?s:.*)pattern", "target_text")
Do note that the result of this method may differ from re.findall("pattern", "target_text")[-1], since the findall method searches for non-overlapping matches, and not all substrings which can be matched are included in the result.
For example, executing the regex a.a on abaca, findall would return aba as the only match and select it as the last match, while the code above will return aca as the match.
Yet another alternative is to use regex package, which supports REVERSE matching mode.
The result would be more or less the same as the method with (?s:.*) in re package as described above. However, since I haven't tried the package myself, it's not clear how backreference works in REVERSE mode - the pattern might require modification in such cases.
import re
re.search("pattern(?!.*pattern)", "target_text")
or
import re
re.findall("pattern", "target_text")[-1]
You can use these 2 approaches.
If you want positions use
x="abc abc abc"
print [(i.start(),i.end(),i.group()) for i in re.finditer(r"abc",x)][-1]
One approach is to use split. For example if you wanted to get the last group after ':' in this sample string:
mystr = 'dafdsaf:ewrewre:cvdsfad:ewrerae'
':'.join(mystr.split(':')[-1:])

Extract file name with a regular expression

I want to create a regular expressions to extract the filename of an url
https://example.net/img/src/img.jpg
I want to extract img1.jpg
I use urlparse from python but it extract the path in this way
img/src/img.jpg
How I can extract the file name with a regular expression
Using str.split and negative indexing
url = "https://example.net/img/src/img.jpg"
print(url.split("/")[-1])
Output:
img.jpg
or using os.path.basename
import urlparse, os
url = "https://example.net/img/src/img.jpg"
a = urlparse.urlparse(url)
print(os.path.basename(a.path)) #--->img.jpg
You can either use a split on / and select the last element of the returned array (the best solution in my opinion)
or if you really want to use a regex you can use the following one
(?<=\/)(?:(?:\w+\.)*\w+)$
Note that only the following filenames are accepted: DEMO
You can adapt and change the \w to accept other characters if necessary.
Explanations:
(?<=\/) positive lookbehind on / and $ add the constraint that the filename string is the last element of the path
(?:(?:\w+\.)*\w+) is used to extract words that are composed of several letters/digits and eventually underscores followed by a dot, this group can be repeated as many time as necessary (xxx.tar.gz file for example) and then followed by the final extension.
If your url pattern is static you can use positive lookahead ,
import re
pattern =r'\w+(?=\.jpg)'
text="""https://example.net/img/src/img.jpg
"""
print(re.findall(pattern,text)[0])
output:
img

Python - Parsing JSON formatted text file with regex

I have a text file formatted like a JSON file however everything is on a single line (could be a MongoDB File). Could someone please point me in the direction of how I could extract values using a Python regex method please?
The text shows up like this:
{"d":{"__type":"WikiFileNodeContent:http:\/\/samplesite.com.‌​au\/ns\/business\/wi‌​ki","author":null,"d‌​escription":null,"fi‌​leAssetId":"034b9317‌​-60d9-45c2-b6d6-0f24‌​b59e1991","filename"‌​:"Reports.pdf"},"cre‌​atedBy":1531,"create‌​dByUsername":"John Cash","icon":"\/Assets10.37.5.0\/pix\/16x16\/page_white_acro‌​bat.png","id":3041,"‌​inheritedPermissions‌​":false,"name":"map"‌​,"permissions":[23,8‌​7,35,49,65],"type":3‌​,"viewLevel":2},{"__‌​type":"WikiNode:http‌​:\/\/samplesite.com.‌​au\/ns\/business\/wi‌​ki","children":[],"c‌​ontent":
I am wanting to get the "fileAssetId" and filename". Ive tried to load the like with Pythons JSON module but I get an error
For the FileAssetid I tried this regex:
regex = re.compile(r"([0-9a-f]{8})\S*-\S*([0-9a-f]{4})\S*-\S*([0-9a-f]{4})\S*-\S*([0-9a-f]{4})\S*-\S*([0-9a-f]{12})")
But i get the following 034b9317‌​, 60d9, 45c2, b6d6, 0f24‌​b59e1991
Im not to sure how to get the data as its displayed.
How about using positive lookahead and lookbehind:
(?<=\"fileAssetId\":\")[a-fA-F0-9-]+?(?=\")
captures the fileAssetId and
(?<=\"filename\":\").+?(?=\")
matches the filename.
For a detailed explanation of the regex have a look at the Regex101-Example. (Note: I combined both in the example with an OR-Operator | to show both matches at once)
To get a list of all matches use re.findall or re.finditer instead of re.match.
re.findall(pattern, string) returns a list of matching strings.
re.finditer(pattern, string) returns an iterator with the objects.
You can use python's walk method and check each entry with re.match.
In case that the string you got is not convertable to a python dict, you can use just regex:
print re.match(r'.*fileAssetId\":\"([^\"]+)\".*', your_pattern).group(1)
Solution for your example:
import re
example_string = '{"d":{"__type":"WikiFileNodeContent:http:\/\/samplesite.com.u\/ns\/business\/wiki","author":null,"description":null,"fileAssetId":"034b9317-60d9-45c2-b6d6-0f24b59e1991","filename":"Reports.pdf"},"createdBy":1531,"createdByUsername":"John Cash","icon":"\/Assets10.37.5.0\/pix\/16x16\/page_white_acrobat.png","id":3041,"inheritedPermissions":false,"name":"map","permissions":[23,87,35,49,65],"type":3,"viewLevel":2},{"__type":"WikiNode:http:\/\/samplesite.com.au\/ns\/business\/wiki","children":[],"content"'
regex_pattern = r'.*fileAssetId\":\"([^\"]+)\".*'
match = re.match(regex_pattern, example_string)
fileAssetId = match.group(1)
print('fileAssetId: {}'.format(fileAssetId))
executing this yields:
34b9317‌​-60d9-45c2-b6d6-0f24‌​b59e1991
Try adding \n to the string that you are entering in to the file (\n means new line)
Based on the idea given here https://stackoverflow.com/a/3845829 and by following the JSON standard https://www.json.org/json-en.html, we can use Python + regex https://pypi.org/project/regex/ and do the following:
json_pattern = (
r'(?(DEFINE)'
r'(?P<whitespace>( |\n|\r|\t)*)'
r'(?P<boolean>true|false)'
r'(?P<number>-?(0|([1-9]\d*))(\.\d*[1-9])?([eE][+-]?\d+)?)'
r'(?P<string>"([^"\\]|\\("|\\|/|b|f|n|r|t|u[0-9a-fA-F]{4}))*")'
r'(?P<array>\[((?&whitespace)|(?&value)(,(?&value))*)\])'
r'(?P<key>(?&whitespace)(?&string)(?&whitespace))'
r'(?P<value>(?&whitespace)((?&boolean)|(?&number)|(?&string)|(?&array)|(? &object)|null)(?&whitespace))'
r'(?P<object>\{((?&whitespace)|(?&key):(?&value)(,(?&key):(?&value))*)\})'
r'(?P<document>(?&object)|(?&array))'
r')'
r'(?&document)'
)
json_regex = regex.compile(json_pattern)
match = json_regex.match(json_document_text)
You can change last line in json_pattern to match not document but individual objects replacing (?&document) by (?&object). I think the regex is easier than I expected, but I did not run extensive tests on this. It works fine for me and I have tested hundreds of files. I wil try to improve my answer in case I find any issue when running it.

Regex to extract file paths except urls

I have a large text containing some file paths, and I need a regex which can help me extract all the paths. Currently I'm using this one:
\/.+?\/[\w]+\.\w+
It works almost perfectly, but links containing filename or a dot at the end are interpreted as paths too, like this one:
http://example.com/index.html
Help in providing a valid regular expression is highly appreciated. Also if you can add support of spaces in paths in this regex, it would be awesome. Thanks in advance!
Link to regex test: click
You could try a negative look-behind to exclude the "http:" and "https:" prefix.
(?<!https:)(?<!http:)(?<!/)(?<!\w)((/[^\s]+)?/\w+\.\w+)
If you try it with this test strings in pythex:
/abc/def/def.ps
/abc/def/ttt/def.ps
/test.txt
/abc/test.txt http://example.com/index.html
http://www.google.com/bla/test/index.html https://www.google.com/bla/test/index.html
It will only match the first 4.
Here is the pythex link.
The advantage of this regular expression is that is does not rely on the beginning of the line to work.
You can add as many look behinds as you wish to support other protocols, like ftp, etc.
Try this : ^\/.+?\/[\w]+\.\w+$ with multi-line mode enabled.

How can I match any substring except a particular one in python

I want to write a regular expression that will match the following string
a (any substring except 'ABC') ABC
An example for this would be a pqrs h js ABC
The tricky part is to match any substring except 'ABC'. Since the document in which I am searching for, can contain multiple lines that contain such pattern and I want to find all the lines separately I can't use the following expression
a.*ABC
because this would just give me the line where the first a is found extending uptill where the last 'ABC' is found in the document.
There is this answer which says I can use look ahead negation but that is not working in python, or maybe in my case because there is substring before and I have not tested simply using that expression because it will not serve my purpose
Use the non greedy quantifier i.e ?
^a.*?ABC

Categories

Resources