python regex with repeating subpattern - python

I am wondering if there is a 'smart' way (one regex expression) to extract IDs from the following paragraph:
...
imgList = '9/optimized/1260089_fpx.tif,0/optimized/1260090_fpx.tif';
...
The result shoul be a list containing 1260089 and 1260090. The count of the IDs might be up to 10.
I need something like:
re.findall('imgList = (some expression)', string)
Any ideas?

Best would be to use a single regex finding all the numbers. I call for re.findall
>>> imgList = '9/optimized/1260089_fpx.tif,0/optimized/1260090_fpx.tif'
>>> import re
>>> re.findall('optimized/([0-9]*)_fpx', imgList)
['1260089', '1260090']
You could of course make the regex stronger, but if the data is as you indicated, this should suffice.

import re
s = '9/optimized/1260089_fpx.tif,0/optimized/1260090_fpx.tif'
print(re.findall(r'(\d+)_fpx.tif', s))

If the optimzed/ an _fpx part is not ensured and the ID is between 7 and 10 digits
you could do something like
import re
re.findall('[\d]{7,10}', imgList)
This will find a 7 to 10 digit number in the string, hence, IDs with 0-6 or more than 10 digits will be excluded.

import re
imgList = '9/optimized/1260089_fpx.tif,0/optimized/1260090_fpx.tif'
re.findall(r'([0-9]){7}',imgList)
['1260089', '1260090']
The code can only meet your situation.

Related

I have a string "hello\n1hello123\n2yahoo". Want to split it with \n[integer value]

I have a string in python.
str1 = "hello\n1hello123\n2yahoo"
I would like to split this with \n[integer value] to get a list that looks like:
[hello, hello123, yahoo]
Can anyone please help?
As someone who goes too far in avoiding regular expressions (avoiding them whenever possible rather than simply avoiding them when they are inappropriate), I would lean towards splitting on \n and processing the resulting list element by element:
from string import digits
result = [x.lstrip(digits) for x in str1.split("\n")]
If you are less regex-averse than I, and as recommended in the comments,
from re import split
from string import digits
results = split(f'\n[{digits}]*', str1)
regular expression package
enter code here
str1 = "hello\n1hello123\n2yahoo"
import re
print(re.split(r"\n[1-9]", str1))

Python regular expression match number in string

I used regular expression in python2.7 to match the number in a string but I can't match a single number in my expression, here are my code
import re
import cv2
s = '858 1790 -156.25 2'
re_matchData = re.compile(r'\-?\d{1,10}\.?\d{1,10}')
data = re.findall(re_matchData, s)
print data
and then print:
['858', '1790', '-156.25']
but when I change expression from
re_matchData = re.compile(r'\-?\d{1,10}\.?\d{1,10}')
to
re_matchData = re.compile(r'\-?\d{0,10}\.?\d{1,10}')
then print:
['858', '1790', '-156.25', '2']
is there any confuses between d{1, 10} and d{0,10} ?
If I did wrong, how to correct it ?
Thanks for checking my question !
try this:
r'\-?\d{1,10}(?:\.\d{1,10})?'
use (?:)? to make fractional part optional.
for r'\-?\d{0,10}\.?\d{1,10}', it is \.?\d{1,10} who matched 2.
The first \d{1,10} matches from 1 to 10 digits, and the second \d{1,10} also matches from 1 to 10 digits. In order for them both to match, you need at least 2 digits in your number, with an optional . between them.
You should make the entire fraction optional, not just the ..
r'\-?\d{1,10}(?:\.\d{1,10})?'
I would rather do as follows:
import re
s = '858 1790 -156.25 2'
re_matchData = re.compile(r'\-?\d{1,10}\.?\d{0,10}')
data = re_matchData.findall(s)
print data
Output:
['858', '1790', '-156.25', '2']

How do I extract some string from a long string in Python?

I have a lot of long strings - not all of them have the same length and content, so that's why I can't use indices - and I want to extract a string from all of them. This is what I want to extract:
http://www.someDomainName.com/anyNumber
SomeDomainName doesn't contain any numbers and and anyNumber is different in each long string. The code should extract the desired string from any string possible and should take into account spaces and any other weird thing that might appear in the long string - should be possible with regex right? -. Could anybody help me with this? Thank you.
Update: I should have said that www. and .com are always the same. Also someDomainName! But there's another http://www. in the string
import re
results = re.findall(r'\bhttp://www\.someDomainName\.com/\d+\b', long_string)
>>> import re
>>> pattern = re.compile("(http://www\\.)(\\w*)(\\.com/)(\\d+)")
>>> matches = pattern.search("http://www.someDomainName.com/2134")
>>> if matches:
print matches.group(0)
print matches.group(1)
print matches.group(2)
print matches.group(3)
print matches.group(4)
http://www.someDomainName.com/2134
http://www.
someDomainName
.com/
2134
In the above pattern, we have captured 5 groups -
One is the complete string that is matched
Rest are in the order of the brackets you see.. (So, you are looking for the second one..) - (\\w*)
If you want, you can capture only the part of the string you are interested in.. So, you can remove the brackets from rest of the pattern that you don't want and just keep (\w*)
>>> pattern = re.compile("http://www\\.(\\w*)\\.com/\\d+")
>>> matches = patter.search("http://www.someDomainName.com/2134")
>>> if matches:
print matches.group(1)
someDomainName
In the above example, you won't have groups - 2, 3 and 4, as in the previous example, as we have captured only 1 group.. And yes group 0 is always captured.. That is the complete string that matches..
Yeah, your simplest bet is regex. Here's something that will probably get the job done:
import re
matcher = re.compile(r'www.(.+).com\/(.+)
matches = matcher.search(yourstring)
if matches:
str1,str2 = matches.groups()
If you are sure that there are no dots in SomeDomainName you can just take the first occurence of the string ".com/" and take everything from that index on
this will avoid you the use of regex which are harder to maintain
exp = 'http://www.aejlidjaelidjl.com/alieilael'
print exp[exp.find('.com/')+5:]

How to use regex to parse a number from HTML?

I want to write a simple regular expression in Python that extracts a number from HTML. The HTML sample is as follows:
Your number is <b>123</b>
Now, how can I extract "123", i.e. the contents of the first bold text after the string "Your number is"?
import re
m = re.search("Your number is <b>(\d+)</b>",
"xxx Your number is <b>123</b> fdjsk")
if m:
print m.groups()[0]
Given s = "Your number is <b>123</b>" then:
import re
m = re.search(r"\d+", s)
will work and give you
m.group()
'123'
The regular expression looks for 1 or more consecutive digits in your string.
Note that in this specific case we knew that there would be a numeric sequence, otherwise you would have to test the return value of re.search() to make sure that m contained a valid reference, otherwise m.group() would result in a AttributeError: exception.
Of course if you are going to process a lot of HTML you want to take a serious look at BeautifulSoup - it's meant for that and much more. The whole idea with BeautifulSoup is to avoid "manual" parsing using string ops or regular expressions.
import re
x = 'Your number is <b>123</b>'
re.search('(?<=Your number is )<b>(\d+)</b>',x).group(0)
this searches for the number that follows the 'Your number is' string
import re
print re.search(r'(\d+)', 'Your number is <b>123</b>').group(0)
The simplest way is just extract digit(number)
re.search(r"\d+",text)
val="Your number is <b>123</b>"
Option : 1
m=re.search(r'(<.*?>)(\d+)(<.*?>)',val)
m.group(2)
Option : 2
re.sub(r'([\s\S]+)(<.*?>)(\d+)(<.*?>)',r'\3',val)
import re
found = re.search("your number is <b>(\d+)</b>", "something.... Your number is <b>123</b> something...")
if found:
print found.group()[0]
Here (\d+) is the grouping, since there is only one group [0] is used. When there are several groupings [grouping index] should be used.
To extract as python list you can use findall
>>> import re
>>> string = 'Your number is <b>123</b>'
>>> pattern = '\d+'
>>> re.findall(pattern,string)
['123']
>>>
You can use the following example to solve your problem:
import re
search = re.search(r"\d+",text).group(0) #returns the number that is matched in the text
print("Starting Index Of Digit", search.start())
print("Ending Index Of Digit:", search.end())
import re
x = 'Your number is <b>123</b>'
output = re.search('(?<=Your number is )<b>(\d+)</b>',x).group(1)
print(output)

Number of regex matches

I'm using the finditer function in the re module to match some things and everything is working.
Now I need to find out how many matches I've got. Is it possible without looping through the iterator twice? (one to find out the count and then the real iteration)
Some code:
imageMatches = re.finditer("<img src\=\"(?P<path>[-/\w\.]+)\"", response[2])
# <Here I need to get the number of matches>
for imageMatch in imageMatches:
doStuff
Everything works, I just need to get the number of matches before the loop.
If you know you will want all the matches, you could use the re.findall function. It will return a list of all the matches. Then you can just do len(result) for the number of matches.
If you always need to know the length, and you just need the content of the match rather than the other info, you might as well use re.findall. Otherwise, if you only need the length sometimes, you can use e.g.
matches = re.finditer(...)
...
matches = tuple(matches)
to store the iteration of the matches in a reusable tuple. Then just do len(matches).
Another option, if you just need to know the total count after doing whatever with the match objects, is to use
matches = enumerate(re.finditer(...))
which will return an (index, match) pair for each of the original matches. So then you can just store the first element of each tuple in some variable.
But if you need the length first of all, and you need match objects as opposed to just the strings, you should just do
matches = tuple(re.finditer(...))
#An example for counting matched groups
import re
pattern = re.compile(r'(\w+).(\d+).(\w+).(\w+)', re.IGNORECASE)
search_str = "My 11 Char String"
res = re.match(pattern, search_str)
print(len(res.groups())) # len = 4
print (res.group(1) ) #My
print (res.group(2) ) #11
print (res.group(3) ) #Char
print (res.group(4) ) #String
If you find you need to stick with finditer(), you can simply use a counter while you iterate through the iterator.
Example:
>>> from re import *
>>> pattern = compile(r'.ython')
>>> string = 'i like python jython and dython (whatever that is)'
>>> iterator = finditer(pattern, string)
>>> count = 0
>>> for match in iterator:
count +=1
>>> count
3
If you need the features of finditer() (not matching to overlapping instances), use this method.
I know this is a little old, but this but here is a concise function for counting regex patterns.
def regex_cnt(string, pattern):
return len(re.findall(pattern, string))
string = 'abc123'
regex_cnt(string, '[0-9]')
For those moments when you really want to avoid building lists:
import re
import operator
from functools import reduce
count = reduce(operator.add, (1 for _ in re.finditer(my_pattern, my_string)))
Sometimes you might need to operate on huge strings. This might help.
if you are using finditer method best way you can count the matches is to initialize a counter and increment it with each match

Categories

Resources