How to check if a string matches a certain pattern? - python

I have a string which is basically a file path of an .mp4 file.
I want to test if the file path is matching one of the following patterns:
/*.mp4 (nothing before the slash, anything after)
*/*.mp4 (anything before and after the slash)
[!A]*.mp4 (anything before the extension, **except** for the character 'A')
What would be the best way to achieve this?
Thanks!
EDIT:
I'm not looking to test if the file ends with .mp4, i'm looking to test if it ends with it and matches each of those 3 scenarios separately.
I tried using the 'endswith' but it's too general and can't "get specific" like what i'm looking for in my examples.

Here they are:
string.endswith('.mp4') and string.startswith('/')
string.endswith('.mp4') and "/" in string
string.endswith('.mp4') and "A" not in string
Or, look at using fnmatch.

Related

Regular expression in python to get the last occurence of a file extension in a URL or path

Given a long url or path how do I get the last file extension in it. For example consider these two strings.
url = 'https://image.freepik.com/free-vector/vector-chickens-full-emotions_75487-787.jpg?x=2'
path = './image.freepik.com/free-vector/vector-chickens-full-emotions_75487-787.abc.jpg'
The last extension is jpg and comes after the last . and before the following non-alphanumerics or end-of-string.
There are similar questions to mine but I can't find an exact match.
re.search('\.(\w+)(?!.*\.)', url).group(1)
Use negative lookahead to search for matches that aren't followed by dots
Parsing rules are different for FILENAMES, and URLS - so don't make a single REGEX to do that, its not simple and not worth your time.
Instead, make a test of some sort - to determine what type of object you are looking at, ie: This IS or ISNOT a URL. This could be as simple as: Does it start with http://, then it is a URL.. if not ... it is not a URL
Then apply the specific rule to the specific type.
Always make use of standard tools, they have often already figured out the corner cases or things you will forget.
The URL parser: https://docs.python.org/3/library/urllib.parse.html
Then, for files use: os.path.splitext(path)
in the standard python library: https://docs.python.org/3/library/os.path.html

Regex to extract file paths except urls

I have a large text containing some file paths, and I need a regex which can help me extract all the paths. Currently I'm using this one:
\/.+?\/[\w]+\.\w+
It works almost perfectly, but links containing filename or a dot at the end are interpreted as paths too, like this one:
http://example.com/index.html
Help in providing a valid regular expression is highly appreciated. Also if you can add support of spaces in paths in this regex, it would be awesome. Thanks in advance!
Link to regex test: click
You could try a negative look-behind to exclude the "http:" and "https:" prefix.
(?<!https:)(?<!http:)(?<!/)(?<!\w)((/[^\s]+)?/\w+\.\w+)
If you try it with this test strings in pythex:
/abc/def/def.ps
/abc/def/ttt/def.ps
/test.txt
/abc/test.txt http://example.com/index.html
http://www.google.com/bla/test/index.html https://www.google.com/bla/test/index.html
It will only match the first 4.
Here is the pythex link.
The advantage of this regular expression is that is does not rely on the beginning of the line to work.
You can add as many look behinds as you wish to support other protocols, like ftp, etc.
Try this : ^\/.+?\/[\w]+\.\w+$ with multi-line mode enabled.

Regex pattern to match two datetime formats

I am doing a directory listening and need to get all directory names that follow the pattern: Feb14-2014 and 14022014-sometext. The directory names must not contain dots, so I dont want to match 14022014-sometext.more. Like you can see I want to match just the directories that follow the pattern %b%d-%Y and %d%m%Y-textofanylengthWithoutDots.
For the first case it should be something like [a-zA-Z]{3}\d{2}. I dont know how to parse the rest because my regex skills are poor, sorry. So I hope someone can tell me what the correct patterns look like. Thanks.
I am assuming each directory listing is separated by a new line
([A-Z]\w{2}\d{1,2}\-\d{4}|\d{7,8}\-\w+)$
Will match both cases and will match the text only if it is uninterrupted (by dots or anything else for that matter) until it hits the end of the line.
Some notes:
If you want to match everything except dot you may replace the final \w+ with [^.]+.
You need the multiline modifier /m for this to work, otherwise the $ will match the end of the string only.
I've not added a ^ to the start of the regex, but you may do so if each line contains a single directory
Of course you may expand this regex to include (Jan|Feb|Mar|Apr|May|Jun|Jul|Aug|Sep|Oct|Nov|Dec) instead of [A-Z]\w{2}. I've not done this to keep it readable. I would also suggest you store this in a python array and insert it dynamically into your regex for maintainability sake.
See it in action: http://regex101.com/r/pS6iY9
That's quite easy.
The best one I can make is:
((Jan|Feb|Mar|Apr|May|Jun|Jul|Aug|Sep|Oct|Nov|Dec)\d\d-\d\d\d\d)|(\d\d\d\d\d\d\d\d-\w+)
The first part ((Jan|Feb|Mar|Apr|May|Jun|Jul|Aug|Sep|Oct|Nov|Dec)\d\d-\d\d\d\d) matches the first kind of dates and the second part (\d\d\d\d\d\d\d\d-\w+) - the second kind.

python match variable text using regular expression

I am new to python and trying to find out how way to match a sentence with variable words
for examples 'The file test.bed in successfully uploaded'
Now here in the above sentence, the file name would change (it could sample.png) and rest of the words would be same.
Can anybody let me know what is the best way using regular expression to match the sentence.
thanks
If you just want to match anything there:
r'The file (.+?) in successfully uploaded'
The . means any character, and the + means one or more of the preceding.
The ? means to do it non-greedily, so if you have two sentences in a row, like "The file foo.bar is successfully uploaded. The file spam.eggs is successfully uploaded.", it'll match "foo.bar", and then "spam.eggs", rather than just finding one match "foo.bar is successfully uploaded. The file spam.eggs". You may not need it in your application.
Finally, the parentheses are how you mark part of a pattern as a group that you can extract from the match object.
But what if you want to match just valid filenames? Well, you'll need to come up with a rule for valid filenames, which may be different depending on your application. Is it Windows-specific? Is whatever you're parsing quoting filenames with spaces? And so on.

Python regular expression for matching beginning filenames with wildcard endings

I have the following expression
filingReportURL = re.search(r'Archive[\'"]?([^\'" >]+)', utf8line)
Which matches web addresses that begin with Archive but I'm having trouble because I want filenames with an extension but I don't know what that extension is. I.e. there must be a file extension I.e. jpg or .BMP for every case but it could be .xyx123. I've tried adding [\.\w+] to the end but I'm always left with the last letter of the extension missing when I do the search. Any ideas on a better and cleaner way to do this?
Thanks
Why can't you use a simple match like this?
Archive(.*)/(.*)\.([a-z A-Z 0-9]+)
The replace match will be \2.\3 in grep.

Categories

Resources