I'm trying to pull out a number from a copy of an HTML page which I got from using urllib.request
I've tried a few different patterns in regex but keep getting none as the output so I'm clearly not formatting the pattern correctly but can't get it to work
Below is a small part of the HTML I have in the string
</ul>\n \n <p>* * * * *</p>\n -->\n \n <b>DistroWatch database summary</b><br/>\n <ul>\n <li>Number of all distributions in the database: 926<br/>\n <li>Number of <a href="search.php?status=Active">
I'm trying to just get the 926 out of the string and my code is below and I can't figure out what I'm doing wrong
import urllib.request
import re
page = urllib.request.urlopen('http://distrowatch.com/weekly.php?issue=current')
#print(page.read())
print(page.read())
pageString = str(page.read())
#print(pageString)
DistroCount = re.search('^all distributions</a> in the database: ....<br/>\n$', pageString)
print(DistroCount)
any help, pointers or resource suggestions would be much appreciated
You can use BeautifulSoup to convert HTML to text, and then apply a simple regex to extract a number after a hardcoded string:
import urllib.request, re
from bs4 import BeautifulSoup
page = urllib.request.urlopen('http://distrowatch.com/weekly.php?issue=current')
html = page.read()
soup = BeautifulSoup(html, 'lxml')
text = soup.get_text()
m = re.search(r'all distributions in the database:\s*(\d+)', text)
if m:
print(m.group(1))
# => 926
Here,
soup.get_text() converts HTML to plain text and keeps it in the text variable
The all distributions in the database:\s*(\d+) regex matches all distributions in the database:, then zero or more whitespace chars and then captures into Group 1 any one or more digits (with (\d+))
I think your problem is that you are reading the whole document into a single string, but use "^" at beginning of your regex and "$" at the end, so the regex will only match the entire string.
Either drop ^ and $ (and \n as well…), or process your document line by line.
Related
I am trying to use regular expression to extract phone number from web links. The problem I am facing is with unwanted id's and other elements of webpage. If anyone can suggest some improvements, it would be really helpful. Below is the code and regular expression I am using in Python,
from urllib2 import urlopen as uReq
uClient = uReq(url)
page_html = uClient.read()
print re.findall(r"(\(?\d{3}\D{0,3}\d{3}\D{0,3}\d{4}).*?",page_html)
Now, for most of the website, the script getting some page element values and sometimes accurate. Please suggest some modifications in expression
re.findall(r"(\(?\d{3}\D{0,3}\d{3}\D{0,3}\d{4}).*?",page_html)
My output looks like below for different url's
http://www.fraitagengineering.com/index.html
['(877) 424-4752']
http://hunterhawk.com/
['1481240672', '1481240643', '1479852632', '1478013441', '1481054486', '1481054560', '1481054598', '1481054588', '1476820246', '1481054521', '1481054540', '1476819829', '1481240830', '1479855986', '1479855990', '1479855994', '1479855895', '1476819760', '1476741750', '1476741750', '1476820517', '1479862863', '1476982247', '1481058326', '1481240672', '1481240830', '1513106590', '1481240643', '1479855986', '1479855990', '1479855994', '1479855895', '1479852632', '1478013441', '1715282331', '1041873852', '1736722557', '1525761106', '1481054486', '1476819760', '1481054560', '1476741750', '1481054598', '1476741750', '1481054588', '1476820246', '1481054521', '1476820517', '1479862863', '1481054540', '1476982247', '1476819829', '1481058326', '(925) 798-4950', '2093796260']
http://www.lbjewelrydesign.com/
['213-629-1823', '213-629-1823']
I want just phone numbers with (000) 000-0000
(not that I have added space after parenthesis),(000)-000-0000or000-000-0000` format. Any suggestions appreciated. Please note that I have already referred to this link : Find phone numbers in python script
I need improvement in regex for my specific needs.
The following regular expression can be used to match the samples that you presented and other similar numbers:
(\([0-9]{3}\)[\s-]?|[0-9]{3}-)[0-9]{3}-[0-9]{4}
The following example script can be used to test positive and negative cases other than play with the regular expression:
import re
positiveExamples = [
'(000) 000-0000',
'(000)-000-0000',
'(000)000-0000',
'000-000-0000'
]
negativeExamples = [
'000 000-0000',
'000-000 0000',
'000 000 0000',
'000000-0000',
'000-0000000',
'0000000000'
]
reObj = re.compile(r"(\([0-9]{3}\)[\s-]?|[0-9]{3}-)[0-9]{3}-[0-9]{4}")
for example in positiveExamples:
print 'Asserting positive example: %s' % example
assert reObj.match(example)
for example in negativeExamples:
print 'Asserting negative example: %s' % example
assert reObj.match(example) == None
You can avoid searching inside ids, other attributes or inside HTML markup at all if only you would be able to search the plain text of the web page only. You can do it by processing the web page content through BeautifulSoup HTML parser:
from urllib2 import urlopen as uReq
from bs4 import BeautifulSoup
page_text = BeautifulSoup(uReq(url), "html.parser").get_text()
Then, as Jake mentioned in comments, you can make your regular expression more reliable:
Find phone numbers in python script
I am trying to scrape a specific section of a web page, and eventually calculate word frequency. But I am finding it difficult to get the entire text. As far as I understand from looking at the HTML code, my script omits the part of that section that are in a break line but without <br> tag.
My code:
import urllib
from lxml import html as LH
import lxml
import requests
scripturl="http://www.springfieldspringfield.co.uk/view_episode_scripts.php?tv-show=the-sopranos&episode=s06e21"
scripthtml=urllib.urlopen(scripturl).read()
scripthtml=requests.get(scripturl)
tree = LH.fromstring(scripthtml.content)
script=tree.xpath('//div[#class="scrolling-script-container"]/text()')
print script
print type(script)
This is the output:
["\n\n\n\n \t\t\t ( radio clicks, \r music plays ) \r \r Disc jockey: \r
New York's classic rock \r q104.", '3.', '
\r \r Good morning.', " \r I'm jim kerr.",
' \r \r Coming up \r
When I iterate the result only the phrases that follow the /r and are followed by a comma or double comma.
for res in script:
print res
The output is:
q104.
3.
Good morning.
I'm jim kerr.
I am not confined to lxml, but because I am rather new, I am less familiar with other methods.
An lxml element has both a text and tail method. You are searching for text, but if there is am HTML element embedded in the element (br, for example), your search for text will only go as deep as the first text the parser gets from the element's text() method.
try:
script = tree.xpath('//div[#class="scrolling-script-container"]')
print join(" ", (script[0].text(), script[0].tail()))
This was bothering me, I wrote out a solution:
import requests
import lxml
from lxml import etree
from io import StringIO
parser = etree.HTMLParser()
base_url = "http://www.springfieldspringfield.co.uk/view_episode_scripts.php?tv-show=the-sopranos&episode=s06e21"
resp = requests.get(base_url)
root = etree.parse(StringIO(resp.text), parser)
script = root.xpath('//div[#class="scrolling-script-container"]')
text_list = []
for elem in script:
print(elem.attrib)
if hasattr(elem, 'text'):
text_list.append(elem.text)
if hasattr(elem, 'tail'):
text_list.append(elem.tail)
for elem in text_list:
# only gets the first block of text before
# it encounters a br tag
print(elem)
for elem in script:
# prints everything
for sib in elem.iter():
print(sib.attrib)
if hasattr(sib, 'text'):
print(sib.text)
if hasattr(sib, 'tail'):
print(sib.tail)
For a class, I have an exercise where i need to to count the number of images on any give web page. I know that every image starts with , so I am using a regexp to try and locate them. But I keep getting a count of one which i know is wrong, what is wrong with my code:
import urllib
import urllib.request
import re
img_pat = re.compile('<img.*>',re.I)
def get_img_cnt(url):
try:
w = urllib.request.urlopen(url)
except IOError:
sys.stderr.write("Couldn't connect to %s " % url)
sys.exit(1)
contents = str(w.read())
img_num = len(img_pat.findall(contents))
return (img_num)
print (get_img_cnt('http://www.americascup.com/en/schedules/races'))
Don't ever use regex for parsing HTML, use an html parser, like lxml or BeautifulSoup. Here's a working example, how to get img tag count using BeautifulSoup and requests:
from bs4 import BeautifulSoup
import requests
def get_img_cnt(url):
response = requests.get(url)
soup = BeautifulSoup(response.content)
return len(soup.find_all('img'))
print(get_img_cnt('http://www.americascup.com/en/schedules/races'))
Here's a working example using lxml and requests:
from lxml import etree
import requests
def get_img_cnt(url):
response = requests.get(url)
parser = etree.HTMLParser()
root = etree.fromstring(response.content, parser=parser)
return int(root.xpath('count(//img)'))
print(get_img_cnt('http://www.americascup.com/en/schedules/races'))
Both snippets print 106.
Also see:
Python Regex - Parsing HTML
Python regular expression for HTML parsing (BeautifulSoup)
Hope that helps.
Ahhh regular expressions.
Your regex pattern <img.*> says "Find me something that starts with <img and stuff and make sure it ends with >.
Regular expressions are greedy, though; it'll fill that .* with literally everything it can while leaving a single > character somewhere afterwards to satisfy the pattern. In this case, it would go all the way to the end, <html> and say "look! I found a > right there!"
You should come up with the right count by making .* non-greedy, like this:
<img.*?>
Your regular expression is greedy, so it matches much more than you want. I suggest using an HTML parser.
img_pat = re.compile('<img.*?>',re.I) will do the trick if you must do it the regex way. The ? makes it non-greedy.
A good website for checking what your regex matches on the fly: http://www.pyregex.com/
Learn more about regexes: http://docs.python.org/2/library/re.html
I wrote a simple script that just takes a webpage and extracts the contents of it to a tokenized list. However, I'm running into an issue where when I convert the BeautifulSoup object to a String, the UTF-8 characters for ",', etc. won't convert. Instead, they remain in the unicode format.
I'm defining the source as UTF-8 when I create the BeautifulSoup object, and I've even tried running a unicode conversion separately, but nothing works. Any have any idea why this is happening?
from urllib2 import urlopen
from bs4 import BeautifulSoup
import nltk, re, pprint
url = "http://www.bloomberg.com/news/print/2013-07-05/softbank-s-21-6-billion-bid-for- sprint-approved-by-u-s-.html"
raw = urlopen(url).read()
soup = BeautifulSoup(raw, fromEncoding="UTF-8")
result = soup.find_all(id="story_content")
str_result = str(result)
notag = re.sub("<.*?>", " ", str_result)
output = nltk.word_tokenize(notag)
print(output)
The characters you're having trouble with aren't " (U+0022) and ' (U+0027), they're curly quotes “ (U+201C) and ” (U+201D) and ’ (U+2019). Convert those to their straight versions first, and you should get the results you're expecting:
raw = urlopen(url).read()
original = raw.decode('utf-8')
replacement = original.replace('\u201c', '"').replace('\u201d', '"').replace('\u2019', "'")
soup = BeautifulSoup(replacement) # Don't need fromEncoding if we're passing in Unicode
That should get the quote characters into the form you're expecting.
I am trying to request a web page via urllib2 using a regex.
Here is my code
def Get(url):
request = urllib2.Request(url)
page = urlOpener.open(request)
return page.read()
page = Get(myurl)
#page = "<html>.....</html>" #local string for test
pattern = re.compile(r'^\s*(<tr>$\s*<td height="25.*?</tr>)$', re.M | re.I | re.DOTALL)
for task in pattern.findall(taskListPage):
If I use a local string (same as Get(myurl)' s result) for test, the pattern works, but if i use Get(myurl), the pattern does not work.
I will be grateful if someone can tell me why.
Valid reservations about using regex on html aside, try this regex instead:
(<tr>\s*<td height="25.*?</tr>)
You were finding only matches at end of input $, and had problem terms at front of regex.
This match is a brittle - let's hope the web guy doesn't change the height of the rows...