I'm trying to get a list of files from a directory whose file names follow this pattern:
PREFIX_YYYY_MM_DD.dat
For example
FOO_2016_03_23.dat
Can't seem to get the right regex. I've tried the following:
pattern = re.compile(r'(\d{4})_(\d{2})_(\d{2}).dat')
>>> []
pattern = re.compile(r'*(\d{4})_(\d{2})_(\d{2}).dat')
>>> sre_constants.error: nothing to repeat
Regex is certainly a weakpoint for me. Can anyone explain where I'm going wrong?
To get the files, I'm doing:
files = [f for f in os.listdir(directory) if pattern.match(f)]
PS, how would I allow for .dat and .DAT (case insensitive file extension)?
Thanks
You have two issues with your expression:
re.compile(r'(\d{4})_(\d{2})_(\d{2}).dat')
The first one, as a previous comment stated, is that the . right before dat should be escaped by putting a backslash (\) before. Otherwise, python will treat it as a special character, because in regex . represents "any character".
Besides that, you're not handling uppercase exceptions on your expression. You should make a group for this with dat and DAT as possible choices.
With both changes made, it should look like:
re.compile(r'(\d{4})_(\d{2})_(\d{2})\.(?:dat|DAT)')
As an extra note, I added ?: at the beginning of the group so the regex matcher ignores it at the results.
Use pattern.search() instead of pattern.match().
pattern.match() always matches from the start of the string (which includes the PREFIX).
pattern.search() searches anywhere within the string.
Does this do what you want?
>>> import re
>>> pattern = r'\A[a-z]+_\d{4}_\d{2}_\d{2}\.dat\Z'
>>> string = 'FOO_2016_03_23.dat'
>>> re.search(pattern, string, re.IGNORECASE)
<_sre.SRE_Match object; span=(0, 18), match='FOO_2016_03_23.dat'>
>>>
It appears to match the format of the string you gave as an example.
The following should match for what you requested.
[^_]+[_]\d{4}[_]\d{2}[_]\d{2}[\.]\w+
I recommend using https://regex101.com/ (for python regular expressions) or http://regexr.com/ (for javascript regular expressions) in the future if you want to validate your regular expressions.
Related
I have a list of proc names on Linux. Some have slash, some don't. For example,
kworker/23:1
migration/39
qmgr
I need to extract just the proc name without the slash and the rest. I tried a few different ways but still won't get it completely correct. What's wrong with my regex? Any help would be much appreciated.
>>> str='kworker/23:1'
>>> match=re.search(r'^(.+)\/*',str)
>>> match.group(1)
'kworker/23:1'
The problem with the regex is, that the greedy .+ is going until the end, because everything after it is optional, meaning it is kept as short as possible (essentially empty). To fix this replace the . with anything but a /.
([^\/]+)\/?.*
works. You can test this regex here. In case it is new to you, [^\/] matches anything, but a slash., as the ^ in the beginning inverts which characters are matched.
Alternatively, you can also use split as suggested by Moses Koledoye. split is often better for simple string manipulation, while regex enables you to perform very complex tasks with rather little code.
An alternative to regex is to split on slash and take the first item:
>>> s ='kworker/23:1'
>>> s.split('/')[0]
'kworker'
This also works when the string does not contain a slash:
>>> s = 'qmgr'
>>> s.split('/')[0]
'qmgr'
But if you're going to stick to re, I think re.sub is what you want, as you won't need to fetch the matching group:
>>> import re
>>> s ='kworker/23:1'
>>> re.sub(r'/.*$', '', s)
'kworker'
On a side note, assignig the name str shadows the in built string type, which you don't want.
I try find in text using regex the elements like this: abs=abs , 1=1 etc.
i wrote this i this way:
opis="Some text abs=abs sfsdvc"
wyn=re.search('([\w]*)=\1',opis)
print(wyn.group(0))
And this find nothing, when i tried this code in the websites like www.regexr.com it was working correctly.
Am I doing something wrong in python re ?
You must specify the regex as raw string r'..'
>>> opis="Some text abs=abs sfsdvc"
>>> wyn=re.search(r'([\w]*)=\1',opis)
>>> print wyn.group(0)
abs=abs
From re documentation
Raw string notation (r"text") keeps regular expressions sane. Without it, every backslash ('\') in a regular expression would have to be prefixed with another one to escape it. For example, the two following lines of code are functionally identical:
Meaning, if you are not planing to use raw string, then all the \ in the string must be escaped as
>>> opis="Some text abs=abs sfsdvc"
>>> wyn=re.search('([\\w]*)=\\1',opis)
>>> print wyn.group(0)
abs=abs
Change your regex to:
re.search(r'(\w+)=\1', opis).group()
↑
Note that you don't really need character class here, the [ and ] are redundant, also it's better to have \w+ if you don't want to match the string "=" (lonely equal sign).
I have the following regex:
(\b)(con)
This matches:
.con
con
But I only want to match the second line 'con' not '.con'.
This then needs expanding to enable me to match alternative words (CON|COM1|LPT1) etc. And in those scenarios, I need to match the dot afterwards and potentially file extensions too. I have regex for these. I am attempting to understand one part of the expression at a time.
How can I tighten what I've got to give me the specific match I require?
Edit:
You can use non-delimited capture groups and re.match (which is anchored to the start of the string):
>>> from re import match
>>> strs = ["CON.txt", "LPT1.png", "COM1.html", "CON.jpg"]
>>> # This can be customized to what you want
>>> # Right now, it is matching .jpg and .png files with the proper beginning
>>> [x for x in strs if match("(?:CON|COM1|LPT1)\.(?:jpg|png)$", x)]
['LPT1.png', 'CON.jpg']
>>>
Below is a breakdown of the Regex pattern:
(?:CON|COM1|LPT1) # CON, COM1, or LPT1
\. # A period
(?:jpg|png) # jpg or png
$ # The end of the string
You may also want to add (?i) to the start of the pattern in order to have case-insensitive matching.
^ matches start of a string:
^con
would work.
If I want to replace a pattern in the following statement structure:
cat&345;
bat &#hut;
I want to replace elements starting from & and ending before (not including ;). What is the best way to do so?
Including or not including the & in the replacement?
>>> re.sub(r'&.*?(?=;)','REPL','cat&345;') # including
'catREPL;'
>>> re.sub(r'(?<=&).*?(?=;)','REPL','bat &#hut;') # not including
'bat &REPL;'
Explanation:
Although not required here, use a r'raw string' to prevent having to escape backslashes which often occur in regular expressions.
.*? is a "non-greedy" match of anything, which makes the match stop at the first semicolon.
(?=;) the match must be followed by a semicolon, but it is not included in the match.
(?<=&) the match must be preceded by an ampersand, but it is not included in the match.
Here is a good regex
import re
result = re.sub("(?<=\\&).*(?=;)", replacementstr, searchText)
Basically this will put the replacement in between the & and the ;
Maybe go a different direction all together and use HTMLParser.unescape(). The unescape() method is undocumented, but it doesn't appear to be "internal" because it doesn't have a leading underscore.
You can use negated character classes to do this:
import re
st='''\
cat&345;
bat &#hut;'''
for line in st.splitlines():
print line
print re.sub(r'([^&]*)&[^;]*;',r'\1;',line)
I want to read a word html file and grab any words which contain letters of a name but not print them if the words are longer than the name
# compiling the regular expression:
keyword = re.compile(r"^[(rR)|(yY)|(aA)|(nN)]{5}$/")
if keyword.search (line):
print line,
i am grabbing the words with this but don't seem to be limiting the size properly.
it seems you are looking for keyword.match() instead of keyword.search(). you should read this part of the python documentation which discusses the difference between match and search.
also, your regular expression seems completely off... [ and ] delimits a set of characters, so you can't put groups and have a logic around the groups. as written, your expression will also match all (, ) and |. you may try the following:
keyword = re.compile(r"^[rRyYaAnN]{5}$")
Your RE "^[(rR)|(yY)|(aA)|(nN)]{5}$/" will never never never give a matching in any string on earth and elsewhere, I think, because of the '/' character after '$'
See the results of the RE without this '/':
import re
pat = re.compile("^[(rR)|(yY)|(aA)|(nN)]{5}$")
for ch in ('arrrN','Aar)N','()|Ny','NNNNN',
'marrrN','12Aar)NUUU','NNNNN!'):
print ch.ljust(15),pat.search(ch)
result
arrrN <_sre.SRE_Match object at 0x011C8EC8>
Aar)N <_sre.SRE_Match object at 0x011C8EC8>
()|Ny <_sre.SRE_Match object at 0x011C8EC8>
NNNNN <_sre.SRE_Match object at 0x011C8EC8>
marrrN None
12Aar)NUUU None
NNNNN! None
My advice: think of [.....] in a RE as representing ONE character at ONE position. So every character that is between the brackets is one of the options of represented character.
Moreover, as said by Adrien Plisson, between brackets [......] a lot of special characters lost their speciality. Hence '(', ')','|' don't define group and OR, they represent just these characters as some of the options along with the letters 'aArRyYnN'
.
"^[rRyYaAnN]{1,5}$" will match only strings as 'r',ar','YNa','YYnA','Nanny'
If you want to match the same words anywhere in a text, you will need "[rRyYaAnN]{1,5}"