I have the following expression
filingReportURL = re.search(r'Archive[\'"]?([^\'" >]+)', utf8line)
Which matches web addresses that begin with Archive but I'm having trouble because I want filenames with an extension but I don't know what that extension is. I.e. there must be a file extension I.e. jpg or .BMP for every case but it could be .xyx123. I've tried adding [\.\w+] to the end but I'm always left with the last letter of the extension missing when I do the search. Any ideas on a better and cleaner way to do this?
Thanks
Why can't you use a simple match like this?
Archive(.*)/(.*)\.([a-z A-Z 0-9]+)
The replace match will be \2.\3 in grep.
Related
I have a large text containing some file paths, and I need a regex which can help me extract all the paths. Currently I'm using this one:
\/.+?\/[\w]+\.\w+
It works almost perfectly, but links containing filename or a dot at the end are interpreted as paths too, like this one:
http://example.com/index.html
Help in providing a valid regular expression is highly appreciated. Also if you can add support of spaces in paths in this regex, it would be awesome. Thanks in advance!
Link to regex test: click
You could try a negative look-behind to exclude the "http:" and "https:" prefix.
(?<!https:)(?<!http:)(?<!/)(?<!\w)((/[^\s]+)?/\w+\.\w+)
If you try it with this test strings in pythex:
/abc/def/def.ps
/abc/def/ttt/def.ps
/test.txt
/abc/test.txt http://example.com/index.html
http://www.google.com/bla/test/index.html https://www.google.com/bla/test/index.html
It will only match the first 4.
Here is the pythex link.
The advantage of this regular expression is that is does not rely on the beginning of the line to work.
You can add as many look behinds as you wish to support other protocols, like ftp, etc.
Try this : ^\/.+?\/[\w]+\.\w+$ with multi-line mode enabled.
I have a string which is basically a file path of an .mp4 file.
I want to test if the file path is matching one of the following patterns:
/*.mp4 (nothing before the slash, anything after)
*/*.mp4 (anything before and after the slash)
[!A]*.mp4 (anything before the extension, **except** for the character 'A')
What would be the best way to achieve this?
Thanks!
EDIT:
I'm not looking to test if the file ends with .mp4, i'm looking to test if it ends with it and matches each of those 3 scenarios separately.
I tried using the 'endswith' but it's too general and can't "get specific" like what i'm looking for in my examples.
Here they are:
string.endswith('.mp4') and string.startswith('/')
string.endswith('.mp4') and "/" in string
string.endswith('.mp4') and "A" not in string
Or, look at using fnmatch.
I googled it and read the docs but couldn't understand so how should I do it.
I want to check if the string exists as a text file.
So, I tried using re.search('*.txt', string) in python.
but '*.txt' is error. (when I change it 'txt', it's working but it can be a file name not as a file type)
So how should I do this?
Too much work.
S.endswith('.txt')
You can do:
re.search(".*\.txt$", "hello.txt")
meaning match (.*) whatever you want, then a dot (.), then at the end match "txt".
I am doing a directory listening and need to get all directory names that follow the pattern: Feb14-2014 and 14022014-sometext. The directory names must not contain dots, so I dont want to match 14022014-sometext.more. Like you can see I want to match just the directories that follow the pattern %b%d-%Y and %d%m%Y-textofanylengthWithoutDots.
For the first case it should be something like [a-zA-Z]{3}\d{2}. I dont know how to parse the rest because my regex skills are poor, sorry. So I hope someone can tell me what the correct patterns look like. Thanks.
I am assuming each directory listing is separated by a new line
([A-Z]\w{2}\d{1,2}\-\d{4}|\d{7,8}\-\w+)$
Will match both cases and will match the text only if it is uninterrupted (by dots or anything else for that matter) until it hits the end of the line.
Some notes:
If you want to match everything except dot you may replace the final \w+ with [^.]+.
You need the multiline modifier /m for this to work, otherwise the $ will match the end of the string only.
I've not added a ^ to the start of the regex, but you may do so if each line contains a single directory
Of course you may expand this regex to include (Jan|Feb|Mar|Apr|May|Jun|Jul|Aug|Sep|Oct|Nov|Dec) instead of [A-Z]\w{2}. I've not done this to keep it readable. I would also suggest you store this in a python array and insert it dynamically into your regex for maintainability sake.
See it in action: http://regex101.com/r/pS6iY9
That's quite easy.
The best one I can make is:
((Jan|Feb|Mar|Apr|May|Jun|Jul|Aug|Sep|Oct|Nov|Dec)\d\d-\d\d\d\d)|(\d\d\d\d\d\d\d\d-\w+)
The first part ((Jan|Feb|Mar|Apr|May|Jun|Jul|Aug|Sep|Oct|Nov|Dec)\d\d-\d\d\d\d) matches the first kind of dates and the second part (\d\d\d\d\d\d\d\d-\w+) - the second kind.
Basically I want to screen for the following files in python using glob.glob module:
log_fasdsaf
log_bifsd72q
log_asfd8
...
but excluding:
log_fdsaf_7832
log_fsafn_fsda
log_dsaf8_8d
...
Naively played around with linux wildcard (eg log_[!_] but apparently not working). How can I use inverse or negative wildcards when pattern matching in a unix/linux shell? seems not helping, and thanks for advices!
you are using the wrong character to say none of this character...
If you're looking to find any file which has log_ at the beginning, and then a load of characters where none of them are _, then you just need to do this:
log_[^_]*
You are close. The pattern you are looking for is log_[^_]*. It says it must have 'log_' followed by zero or more non-underscore characters.