How to use regex to parse a number from HTML? - python

I want to write a simple regular expression in Python that extracts a number from HTML. The HTML sample is as follows:
Your number is <b>123</b>
Now, how can I extract "123", i.e. the contents of the first bold text after the string "Your number is"?

import re
m = re.search("Your number is <b>(\d+)</b>",
"xxx Your number is <b>123</b> fdjsk")
if m:
print m.groups()[0]

Given s = "Your number is <b>123</b>" then:
import re
m = re.search(r"\d+", s)
will work and give you
m.group()
'123'
The regular expression looks for 1 or more consecutive digits in your string.
Note that in this specific case we knew that there would be a numeric sequence, otherwise you would have to test the return value of re.search() to make sure that m contained a valid reference, otherwise m.group() would result in a AttributeError: exception.
Of course if you are going to process a lot of HTML you want to take a serious look at BeautifulSoup - it's meant for that and much more. The whole idea with BeautifulSoup is to avoid "manual" parsing using string ops or regular expressions.

import re
x = 'Your number is <b>123</b>'
re.search('(?<=Your number is )<b>(\d+)</b>',x).group(0)
this searches for the number that follows the 'Your number is' string

import re
print re.search(r'(\d+)', 'Your number is <b>123</b>').group(0)

The simplest way is just extract digit(number)
re.search(r"\d+",text)

val="Your number is <b>123</b>"
Option : 1
m=re.search(r'(<.*?>)(\d+)(<.*?>)',val)
m.group(2)
Option : 2
re.sub(r'([\s\S]+)(<.*?>)(\d+)(<.*?>)',r'\3',val)

import re
found = re.search("your number is <b>(\d+)</b>", "something.... Your number is <b>123</b> something...")
if found:
print found.group()[0]
Here (\d+) is the grouping, since there is only one group [0] is used. When there are several groupings [grouping index] should be used.

To extract as python list you can use findall
>>> import re
>>> string = 'Your number is <b>123</b>'
>>> pattern = '\d+'
>>> re.findall(pattern,string)
['123']
>>>

You can use the following example to solve your problem:
import re
search = re.search(r"\d+",text).group(0) #returns the number that is matched in the text
print("Starting Index Of Digit", search.start())
print("Ending Index Of Digit:", search.end())

import re
x = 'Your number is <b>123</b>'
output = re.search('(?<=Your number is )<b>(\d+)</b>',x).group(1)
print(output)

Related

Best way to convert string to integer in Python

I have a spreadsheet with text values like A067,A002,A104. What is most efficient way to do this? Right now I am doing the following:
str = 'A067'
str = str.replace('A','')
n = int(str)
print n
Depending on your data, the following might be suitable:
import string
print int('A067'.strip(string.ascii_letters))
Python's strip() command takes a list of characters to be removed from the start and end of a string. By passing string.ascii_letters, it removes any preceding and trailing letters from the string.
If the only non-number part of the input will be the first letter, the fastest way will probably be to slice the string:
s = 'A067'
n = int(s[1:])
print n
If you believe that you will find more than one number per string though, the above regex answers will most likely be easier to work with.
You could use regular expressions to find numbers.
import re
s = 'A067'
s = re.findall(r'\d+', s) # This will find all numbers in the string
n = int(s[0]) # This will get the first number. Note: If no numbers will throw exception. A simple check can avoid this
print n
Here's some example output of findall with different strings
>>> a = re.findall(r'\d+', 'A067')
>>> a
['067']
>>> a = re.findall(r'\d+', 'A067 B67')
>>> a
['067', '67']
You can use the replace method of regex from re module.
import re
regex = re.compile("(?P<numbers>.*?\d+")
matcher = regex.search(line)
if matcher:
numbers = int(matcher.groupdict()["numbers"] #this will give you the numbers from the captured group
import string
str = 'A067'
print (int(str.strip(string.ascii_letters)))

How to read only number from a specific line using python script

How to read only number from a specific line using python script for example
"1009 run test jobs" here i should read only number "1009" instead of "1009 run test jobs"
Or this if your number always comes first int(line.split()[0])
a simple regexp should do:
import re
match = re.match(r"(\d+)", "1009 run test jobs")
if match:
number = match.group()
https://docs.python.org/3/library/re.html
Use regular expression:
>>> import re
>>> x = "1009 run test jobs"
>>> re.sub("[^0-9]","",x)
>>> re.sub("\D","",x) #better way
Or a simple check if its numbers in a string.
[int(s) for s in str.split() if s.isdigit()]
Where str is your string of text.
Pretty sure there is a "more pythonic" way, but this works for me:
s='teststri3k2k3s21k'
outs=''
for i in s:
try:
numbr = int(i)
outs+=i
except:
pass
print(outs)
If the number is always at the beginning of your string, you might consider something like outstring = instring[0,3].
You can do it with regular expression. That's very easy:
import re
regularExpression = "[^\d-]*(-?[0-9]+).*"
line = "some text -123 some text"
m = re.search(regularExpression, line)
if m:
print(m.groups()[0])
This regular expression extracts the first number in a text. It considers '-' as part of numbers. If you don't want this change regular expression to this one: "[^\d-]*([0-9]+).*"

Python - Max index value of a string

I have a list of strings and i would like to extract : "000000_5.612230" of :
A = '/calibration/test_min000000_5.612230.jpeg'
As the size of the strings could evolve, I try with monitoring the position of "n" of "min". I try to get the good index with :
print sorted(A, key=len).index('n')
But i got "11" which corresponds to the "n" of "calibration". I would like to know how to get the maximum index value of the string?
it is difficult to answer since you don't specify what part of the filename remains constant and what is subject to change. is it always a jpeg? is the number always the last part? is it always preceded with '_min' ?
in any case, i would suggest using a regex instead:
import re
A = '/calibration/test_min000000_5.612230.jpeg'
p = re.compile('.*min([_\d\.]*)\.jpeg')
value = p.search(A).group(1)
print value
output :
000000_5.612230
note that this snippet assumes that a match is always found, if the filename doesn't contain the pattern then p.search(...) will return None and an exception will be raised, you'll check for that case.
You can use re module and the regex to do that, for example:
import re
A = '/calibration/test_min000000_5.612230.jpeg'
text = re.findall('\d.*\d', A)
At now, text is a list. If you print it the output will be like this: ['000000_5.612230']
So if you want to extract it, just do this or use for:
import re
A = '/calibration/test_min000000_5.612230.jpeg'
text = re.findall('\d.*\d', A)
print text[0]
String slicing seems like a good solution for this
>>> A = '/calibration/test_min000000_5.612230.jpeg'
>>> start = A.index('min') + len('min')
>>> end = A.index('.jpeg')
>>> A[start:end]
'000000_5.612230'
Avoids having to import re
Try this (if extension is always '.jpeg'):
A.split('test_min')[1][:-5]
If your string is regular at the end, you can use negative indices to slice the string:
>>> a = '/calibration/test_min000000_5.612230.jpeg'
>>> a[-20:-5]
'000000_5.612230'

python regex with repeating subpattern

I am wondering if there is a 'smart' way (one regex expression) to extract IDs from the following paragraph:
...
imgList = '9/optimized/1260089_fpx.tif,0/optimized/1260090_fpx.tif';
...
The result shoul be a list containing 1260089 and 1260090. The count of the IDs might be up to 10.
I need something like:
re.findall('imgList = (some expression)', string)
Any ideas?
Best would be to use a single regex finding all the numbers. I call for re.findall
>>> imgList = '9/optimized/1260089_fpx.tif,0/optimized/1260090_fpx.tif'
>>> import re
>>> re.findall('optimized/([0-9]*)_fpx', imgList)
['1260089', '1260090']
You could of course make the regex stronger, but if the data is as you indicated, this should suffice.
import re
s = '9/optimized/1260089_fpx.tif,0/optimized/1260090_fpx.tif'
print(re.findall(r'(\d+)_fpx.tif', s))
If the optimzed/ an _fpx part is not ensured and the ID is between 7 and 10 digits
you could do something like
import re
re.findall('[\d]{7,10}', imgList)
This will find a 7 to 10 digit number in the string, hence, IDs with 0-6 or more than 10 digits will be excluded.
import re
imgList = '9/optimized/1260089_fpx.tif,0/optimized/1260090_fpx.tif'
re.findall(r'([0-9]){7}',imgList)
['1260089', '1260090']
The code can only meet your situation.

How do I extract some string from a long string in Python?

I have a lot of long strings - not all of them have the same length and content, so that's why I can't use indices - and I want to extract a string from all of them. This is what I want to extract:
http://www.someDomainName.com/anyNumber
SomeDomainName doesn't contain any numbers and and anyNumber is different in each long string. The code should extract the desired string from any string possible and should take into account spaces and any other weird thing that might appear in the long string - should be possible with regex right? -. Could anybody help me with this? Thank you.
Update: I should have said that www. and .com are always the same. Also someDomainName! But there's another http://www. in the string
import re
results = re.findall(r'\bhttp://www\.someDomainName\.com/\d+\b', long_string)
>>> import re
>>> pattern = re.compile("(http://www\\.)(\\w*)(\\.com/)(\\d+)")
>>> matches = pattern.search("http://www.someDomainName.com/2134")
>>> if matches:
print matches.group(0)
print matches.group(1)
print matches.group(2)
print matches.group(3)
print matches.group(4)
http://www.someDomainName.com/2134
http://www.
someDomainName
.com/
2134
In the above pattern, we have captured 5 groups -
One is the complete string that is matched
Rest are in the order of the brackets you see.. (So, you are looking for the second one..) - (\\w*)
If you want, you can capture only the part of the string you are interested in.. So, you can remove the brackets from rest of the pattern that you don't want and just keep (\w*)
>>> pattern = re.compile("http://www\\.(\\w*)\\.com/\\d+")
>>> matches = patter.search("http://www.someDomainName.com/2134")
>>> if matches:
print matches.group(1)
someDomainName
In the above example, you won't have groups - 2, 3 and 4, as in the previous example, as we have captured only 1 group.. And yes group 0 is always captured.. That is the complete string that matches..
Yeah, your simplest bet is regex. Here's something that will probably get the job done:
import re
matcher = re.compile(r'www.(.+).com\/(.+)
matches = matcher.search(yourstring)
if matches:
str1,str2 = matches.groups()
If you are sure that there are no dots in SomeDomainName you can just take the first occurence of the string ".com/" and take everything from that index on
this will avoid you the use of regex which are harder to maintain
exp = 'http://www.aejlidjaelidjl.com/alieilael'
print exp[exp.find('.com/')+5:]

Categories

Resources