Python regex findall numbers and dots - python

I'm using re.findall() to extract some version numbers from an HTML file:
>>> import re
>>> text = "<table><td>Test0.2.1.zip</td><td>Test0.2.1</td></table> Test0.2.1"
>>> re.findall("Test([\.0-9]*)", text)
['0.2.1.', '0.2.1', '0.2.1']
but I would like to only get the ones that do not end in a dot.
The filename might not always be .zip so I can't just stick .zip in the regex.
I wanna end up with:
['0.2.1', '0.2.1']
Can anyone suggest a better regex to use? :)

re.findall(r"Test([0-9.]*[0-9]+)", text)
or, a bit shorter:
re.findall(r"Test([\d.]*\d+)", text)
By the way - you do not need to escape the dot in a character class. Inside [] the . has no special meaning, it just matches a literal dot. Escaping it has no effect.

Related

Python: Extract values after decimal using regex

I am given a string which is number example "44.87" or "44.8796". I want to extract everything after decimal (.). I tried to use regex in Python code but was not successful. I am new to Python 3.
import re
s = "44.123"
re.findall(".","44.86")
Something like s.split('.')[1] should work
If you would like to use regex try:
import re
s = "44.123"
regex_pattern = "(?<=\.).*"
matched_string = re.findall(regex_pattern, s)
?<= a negative look behind that returns everything after specified character
\. is an escaped period
.* means "match all items after the period
This online regex tool is a helpful way to test your regex as you build it. You can confirm this solution there! :)

How do I do Python re.search substrings with multi-character wildcard?

I'm trying to extract a substring from a string in Python.
The front end to be trimmed is static and easy to implement, but the rear end has a counter that can run from "_0" to "_9999".
With my current code, the counter still gets included in the substring.
import re
text = "fastq_runid_0dc971f49c42ffb1412caee485f8421a1f9a26ed_0.fastq"
print(text)
substring= re.search('runid_(.*)_*.fas', text).group(0)
print(substring)
Returns
0dc971f49c42ffb1412caee485f8421a1f9a26ed_0.fas
Alternatively,
substring= re.search(r"(?<=runid_).*?(?=_*.fastq)", text).group(0)
returns
0dc971f49c42ffb1412caee485f8421a1f9a26ed_0
Works better but the counter "_0" is still added.
How do I make a robust trim that trims the multi-character counter?
In your regex (?<=runid_).*?(?=_*.fastq) there is a little problem. You have written _* which means zero or more underscores which will make underscore optional and will skip it matching and your .*? will eat _0 too within it which is why in your result you get _0 too. I think you meant _.* and also you should escape the . just before fastq so your updated regex should become this,
(?<=runid_).+(?=_\d{1,4}\.fas)
Demo
Your updated python code,
import re
text = "fastq_runid_0dc971f49c42ffb1412caee485f8421a1f9a26ed_0.fastq"
print(text)
substring= re.search('(?<=runid_).+(?=_\d{1,4}\.fas)', text).group(0)
print(substring)
Prints,
0dc971f49c42ffb1412caee485f8421a1f9a26ed
Also, alternatively, you can use a simple regex without lookarounds and capture the text from first group using this regex,
runid_([^_]+)(?=_\d{1,4}\.fas)
Demo
Your python code with text picking from group(1) instead of group(0)
import re
text = "fastq_runid_0dc971f49c42ffb1412caee485f8421a1f9a26ed_0.fastq"
print(text)
substring= re.search('runid_([^_]+)(?=_\d{1,4}\.fas)', text).group(1)
print(substring)
In this case too it prints,
0dc971f49c42ffb1412caee485f8421a1f9a26ed
You don't need look behind and look ahead to achieve that.
\d{1,4} means min 1 max 4 digits, otherwise it wont match
fastq_runid_(.+)_\d{1,4}\.fastq
https://regex101.com/r/VneElM/1
import re
text = "fastq_runid_0dc971f49c42ffb1412caee485f8421a1f9a26ed_999.fastq"
print(text)
substring= re.search('fastq_runid_(\w+)_(\d+)\.fastq', text)
print(substring.group(1), substring.group(2))
group(1) will give what you want, group(2) will give the counter.

Python regex numbers and underscores

I'm trying to get a list of files from a directory whose file names follow this pattern:
PREFIX_YYYY_MM_DD.dat
For example
FOO_2016_03_23.dat
Can't seem to get the right regex. I've tried the following:
pattern = re.compile(r'(\d{4})_(\d{2})_(\d{2}).dat')
>>> []
pattern = re.compile(r'*(\d{4})_(\d{2})_(\d{2}).dat')
>>> sre_constants.error: nothing to repeat
Regex is certainly a weakpoint for me. Can anyone explain where I'm going wrong?
To get the files, I'm doing:
files = [f for f in os.listdir(directory) if pattern.match(f)]
PS, how would I allow for .dat and .DAT (case insensitive file extension)?
Thanks
You have two issues with your expression:
re.compile(r'(\d{4})_(\d{2})_(\d{2}).dat')
The first one, as a previous comment stated, is that the . right before dat should be escaped by putting a backslash (\) before. Otherwise, python will treat it as a special character, because in regex . represents "any character".
Besides that, you're not handling uppercase exceptions on your expression. You should make a group for this with dat and DAT as possible choices.
With both changes made, it should look like:
re.compile(r'(\d{4})_(\d{2})_(\d{2})\.(?:dat|DAT)')
As an extra note, I added ?: at the beginning of the group so the regex matcher ignores it at the results.
Use pattern.search() instead of pattern.match().
pattern.match() always matches from the start of the string (which includes the PREFIX).
pattern.search() searches anywhere within the string.
Does this do what you want?
>>> import re
>>> pattern = r'\A[a-z]+_\d{4}_\d{2}_\d{2}\.dat\Z'
>>> string = 'FOO_2016_03_23.dat'
>>> re.search(pattern, string, re.IGNORECASE)
<_sre.SRE_Match object; span=(0, 18), match='FOO_2016_03_23.dat'>
>>>
It appears to match the format of the string you gave as an example.
The following should match for what you requested.
[^_]+[_]\d{4}[_]\d{2}[_]\d{2}[\.]\w+
I recommend using https://regex101.com/ (for python regular expressions) or http://regexr.com/ (for javascript regular expressions) in the future if you want to validate your regular expressions.

Regex in python, repeated fragment finding

I try find in text using regex the elements like this: abs=abs , 1=1 etc.
i wrote this i this way:
opis="Some text abs=abs sfsdvc"
wyn=re.search('([\w]*)=\1',opis)
print(wyn.group(0))
And this find nothing, when i tried this code in the websites like www.regexr.com it was working correctly.
Am I doing something wrong in python re ?
You must specify the regex as raw string r'..'
>>> opis="Some text abs=abs sfsdvc"
>>> wyn=re.search(r'([\w]*)=\1',opis)
>>> print wyn.group(0)
abs=abs
From re documentation
Raw string notation (r"text") keeps regular expressions sane. Without it, every backslash ('\') in a regular expression would have to be prefixed with another one to escape it. For example, the two following lines of code are functionally identical:
Meaning, if you are not planing to use raw string, then all the \ in the string must be escaped as
>>> opis="Some text abs=abs sfsdvc"
>>> wyn=re.search('([\\w]*)=\\1',opis)
>>> print wyn.group(0)
abs=abs
Change your regex to:
re.search(r'(\w+)=\1', opis).group()
↑
Note that you don't really need character class here, the [ and ] are redundant, also it's better to have \w+ if you don't want to match the string "=" (lonely equal sign).

python regex and replace

I am trying to learn python and regex at the same time and I am having some trouble in finding how to match till end of string and make a replacement on the fly.
So, I have a string like so:
ss="this_is_my_awesome_string/mysuperid=687y98jhAlsji"
What I'd want is to first find 687y98jhAlsji (I do not know this content before hand) and then replace it to myreplacedstuff like so:
ss="this_is_my_awesome_string/mysuperid=myreplacedstuff"
Ideally, I'd want to do a regex and replace by first finding the contents after mysuperid= (till the end of string) and then perform a .replace or .sub if this makes sense.
I would appreciate any guidance on this.
You can try this:
re.sub(r'[^=]+$', 'myreplacedstuff', ss)
The idea is to use a character class that exclude the delimiter (here =) and to anchor the pattern with $
explanation:
[^=] is a character class and means all characters that are not =
[^=]+ one or more characters from this class
$ end of the string
Since the regex engine works from the left to the right, only characters that are not an = at the end of the string are matched.
You can use regular expressions:
>>> import re
>>> mymatch = re.search(r'mysuperid=(.*)', ss)
>>> ss.replace(mymatch.group(1), 'replacing_stuff')
'this_is_my_awesome_string/mysuperid=replacing_stuff'
You should probably use #Casimir's answer though. It looks cleaner, and I'm not that good at regex :p.

Categories

Resources