python glob for "excluding pattern" - python

Basically I want to screen for the following files in python using glob.glob module:
log_fasdsaf
log_bifsd72q
log_asfd8
...
but excluding:
log_fdsaf_7832
log_fsafn_fsda
log_dsaf8_8d
...
Naively played around with linux wildcard (eg log_[!_] but apparently not working). How can I use inverse or negative wildcards when pattern matching in a unix/linux shell? seems not helping, and thanks for advices!

you are using the wrong character to say none of this character...
If you're looking to find any file which has log_ at the beginning, and then a load of characters where none of them are _, then you just need to do this:
log_[^_]*

You are close. The pattern you are looking for is log_[^_]*. It says it must have 'log_' followed by zero or more non-underscore characters.

Related

Regex to extract file paths except urls

I have a large text containing some file paths, and I need a regex which can help me extract all the paths. Currently I'm using this one:
\/.+?\/[\w]+\.\w+
It works almost perfectly, but links containing filename or a dot at the end are interpreted as paths too, like this one:
http://example.com/index.html
Help in providing a valid regular expression is highly appreciated. Also if you can add support of spaces in paths in this regex, it would be awesome. Thanks in advance!
Link to regex test: click
You could try a negative look-behind to exclude the "http:" and "https:" prefix.
(?<!https:)(?<!http:)(?<!/)(?<!\w)((/[^\s]+)?/\w+\.\w+)
If you try it with this test strings in pythex:
/abc/def/def.ps
/abc/def/ttt/def.ps
/test.txt
/abc/test.txt http://example.com/index.html
http://www.google.com/bla/test/index.html https://www.google.com/bla/test/index.html
It will only match the first 4.
Here is the pythex link.
The advantage of this regular expression is that is does not rely on the beginning of the line to work.
You can add as many look behinds as you wish to support other protocols, like ftp, etc.
Try this : ^\/.+?\/[\w]+\.\w+$ with multi-line mode enabled.

How can I use re.search('.*txt', ) like perl, in python 2.7?

I googled it and read the docs but couldn't understand so how should I do it.
I want to check if the string exists as a text file.
So, I tried using re.search('*.txt', string) in python.
but '*.txt' is error. (when I change it 'txt', it's working but it can be a file name not as a file type)
So how should I do this?
Too much work.
S.endswith('.txt')
You can do:
re.search(".*\.txt$", "hello.txt")
meaning match (.*) whatever you want, then a dot (.), then at the end match "txt".

Python regular expression for matching beginning filenames with wildcard endings

I have the following expression
filingReportURL = re.search(r'Archive[\'"]?([^\'" >]+)', utf8line)
Which matches web addresses that begin with Archive but I'm having trouble because I want filenames with an extension but I don't know what that extension is. I.e. there must be a file extension I.e. jpg or .BMP for every case but it could be .xyx123. I've tried adding [\.\w+] to the end but I'm always left with the last letter of the extension missing when I do the search. Any ideas on a better and cleaner way to do this?
Thanks
Why can't you use a simple match like this?
Archive(.*)/(.*)\.([a-z A-Z 0-9]+)
The replace match will be \2.\3 in grep.

Which is the standard posix way:double \ or single \ in regex expression?

to get file which ended in . ,you can do :
1.R
list.files(path='/home/test', all.files=TRUE, pattern="\\.$")
or
list.files(path='/home/test', all.files=TRUE, pattern=".+\\.$")
must double \ in R ,can use neither .+\.$ nor \.$
2.python
import os
import re
for root, dirs, files in os.walk("/home/test"):
for file in files:
if re.search(".+\.$",file):
print file
can either use .+\\.$ or \.$ or \\.$ in python.
3.shell
find /home/test -regex ".+\.$"
can use ".+\.$" too in shell
I want to know
1.which is the posix way between .+\\.$ and .+\.$ ?
2.why i can't use find /home/test -regex "\.$" in shell?
There is a difference between what you type and what a regex engine sees in the end.
There could be two questions:
what flavor of regex does my tool (R, Python, find) understand?
For example, if you use Python; you should ask what syntax does re module support?
how do I input a regex?
For example, ".+\.$" in find -regex ".+\.$" is interpreted by a shell first. So the question becomes how does your shell interpret ".+\.$"? Once you've answered it you could ask what regex syntax does find support?
POSIX regexes expect a single backslash in the \. pattern (to match a literal dot) but to input it in a particular environment you might require two backslashes e.g., in Python "\\." and r"\." are the same.
Did you even read ?regexp? It's written there. From what I see 'POSIX 1003.2' regular expressions are used.

De-greedifying a regular expression in python

I'm trying to write a regular expression that will convert a full path filename to a short filename for a given filetype, minus the file extension.
For example, I'm trying to get just the name of the .bar file from a string using
re.search('/(.*?)\.bar$', '/def_params/param_1M56/param/foo.bar')
According to the Python re docs, *? is the ungreedy version of *, so I was expecting to get
'foo'
returned for match.group(1) but instead I got
'def_params/param_1M56/param/foo'
What am I missing here about greediness?
What you're missing isn't so much about greediness as about regular expression engines: they work from left to right, so the / matches as early as possible and the .*? is then forced to work from there. In this case, the best regex doesn't involve greediness at all (you need backtracking for that to work; it will, but could take a really long time to run if there are a lot of slashes), but a more explicit pattern:
'/([^/]*)\.bar$'
I would suggest changing your regex so that it doesn't rely on greedyness.
You want only the filename before the extension .bar and everything after the final /. This should do:
re.search(`/[^/]*\.bar$`, '/def_params/param_1M56/param/foo.bar')
What this does is it matches /, then zero or more characters (as much as possible) that are not / and then .bar.
I don't claim to understand the non-greedy operators all that well, but a solution for that particular problem would be to use ([^/]*?)
The regular expressions starts from the right. Put a .* at the start and it should work.
I like regex but there is no need of one here.
path = '/def_params/param_1M56/param/foo.bar'
print path.rsplit('/',1)[1].rsplit('.')[0]
path = '/def_params/param_1M56/param/fululu'
print path.rsplit('/',1)[1].rsplit('.')[0]
path = '/def_params/param_1M56/param/one.before.two.dat'
print path.rsplit('/',1)[1].rsplit('.',1)[0]
result
foo
fululu
one.before.two
Other people have answered the regex question, but in this case there's a more efficient way than regex:
file_name = path[path.rindex('/')+1 : path.rindex('.')]
try this one on for size:
match = re.search('.*/(.*?).bar$', '/def_params/param_1M56/param/foo.bar')

Categories

Resources