Using re.findall() in Python for Web Crawling - python

I am trying to teach myself Python by writing a very simple web crawler with it.
The code for it is here:
#!/usr/bin/python
import sys, getopt, time, urllib, re
LINK_INDEX = 1
links = [sys.argv[len(sys.argv) - 1]]
visited = []
politeness = 10
maxpages = 20
def print_usage():
print "USAGE:\n./crawl [-politeness <seconds>] [-maxpages <pages>] seed_url"
def parse_args():
#code for parsing arguments (works fine so didnt need to be included here)
def crawl():
global links, visited
url = links.pop()
visited.append(url)
print "\ncurrent url: %s" % url
response = urllib.urlopen(url)
html = response.read()
html = html.lower()
raw_links = re.findall(r'<a href="[\w\.-]+"', html)
print "found: %d" % len(raw_links)
for raw_link in raw_links:
temp = raw_link.split('"')
if temp[LINK_INDEX] not in visited and temp[LINK_INDEX] not in links:
links.append(temp[LINK_INDEX])
print "\nunvisited:"
for link in links:
print link
print "\nvisited:"
for link in visited:
print link
parse_args()
while len(visited) < maxpages and len(links) > 0:
crawl()
time.sleep(politeness)
print "politeness = %d, maxpages = %d" % (politeness, maxpages)
I created a small test network in the same working directory of about 10 pages that all link together in various ways, and it seems to work fine, but when I send it out onto the actual internet by itself, it is unable to parse links from files it gets.
It is able to get the html code fine, because I can print that out, but it seems that the re.findall() part is not doing what it is supposed to, because the links list never gets populated. Have I maybe written my regex wrong? It worked fine to find strings like <a href="test02.html" and then parse the link from that, but for some reason, it isn't working for actual web pages. It might be the http part perhaps that is throwing it off?
I've never used regex with Python before so I'm pretty sure that this is the problem. Can anyone give me any idea how express the pattern I am looking for better? Thanks!

The problem is with your regex. There are a whole bunch of ways I could write a valid HTML anchor that your regex wouldn't match. For example, there could be extra whitespace, or line breaks in it, and there are other attributes that could exist that you haven't taken into account. Also, you take no account of different case. For example:
foo
foo
<a class="bar" href="foo">foo</a>
None of these would be matched by your regex.
You probably want something more like this:
<a[^>]*href="(.*?)"
This will match an anchor tag start, followed by any characters other than > (so that we're still matching inside the tag). This might be things like a class or id attribute. The value of the href attribute is then captured in a capture group, which you can extract by
match.group(1)
The match for the href value is also non-greedy. This means it will match the smallest match possible. This is because otherwise if you have other tags on the same line, you'll match beyond what you want to.
Finally, you'll need to add the re.I flag to match in a case insensitive way.

Your regexp doesn't match all valid values for the href attributes, such as path with slashes, and so on. Using [^"]+ (anything different from the closing double quote) instead of [\w\.-]+ would help, but it doesn't matter because… you should not parse HTML with regexps to begin with.
Lev already mentionned BeautifulSoup, you could also look at lxml. It will work better that any hand-crafted regexp you could write.

You probably want this:
raw_links = re.findall(r'<a href="(.+?)"', html)
Use the brackets to indicate what you want returned, otherwise you get the whole match including the <a href=... bit. Now you get everything until the closing quote mark, due to the use of a non-greedy +? operator.
A more discriminating filter might be:
raw_links = re.findall(r'<a href="([^">]+?)"', html)
this matches anything except a quote and a terminating bracket.
These simple RE's will match to URL's that have been commented, URL-like literal strings inside bits of javascript, etc. So be careful about using the results!

Related

Checking Text for The Presence of a Large Set of Keywords

Suppose that I want to check a webpage for the presence of an arbitrarily large number of keywords. How would I go about doing that?
I've tested the xpath selector if response.xpath('//*[text()[contains(.,"red") or contains(.,"blue") or contains(.,”green”)]]'): and it works as expected. The actual set of keywords that I'm interested in checking for is too large to conveniently enter by hand, as above. What I'm interested in is a way to automate that process by generating my selector based on the contents of a file filled with key words.
Starting from a text file with each keyword on its own line, how could I open that file and use it to check whether the keywords it contains appear in the text elements of a given xpath?
I used the threads Xpath contains value A or value B and XPATH Multiple Element Filters to come up with my manual entry solution, but haven't found anything that addresses automation.
Clarification
I'm not interested in just checking to see whether a given xpath contains any of the keywords provided in my list. I also want to use their presence as a precondition for scraping content from the page. The manual system that I've tested works as follows:
item_info = ItemLoader(item=info_categories(), response=response)
if response.xpath('//*[text()[contains(.,"red") or contains(.,"blue") or contains(.,”green”)]]'):
item_info.add_xpath('title', './/some/x/path/text()')
item_info.add_xpath('description', './/some/other/x/path/text()')
return item_info.load_item()
While #alecxe's solution allows me to check the text of a page against a keyword set, switching from 'print' to 'if' and attempting to control the information I extract returns SyntaxError: invalid syntax. Can I combine the convenience of reading in keywords from a list with the function of manually entering them?
Update—exploring Frederic Bazin's regex solution
Over the past few days I've been working with a regex approach to limiting my parse. My code, which adopts Frederic's proposal with a few modifications to account for errors, is as follows:
item_info = ItemLoader(item=info_categories(), response=response)
keywords = '|'.join(re.escape(word.strip()) for word in open('keys.txt'))
r = re.compile('.*(%s).*' % keywords, re.MULTILINE|re.UNICODE)
if r.match(response.body_as_unicode()):
item_info.add_xpath('title', './/some/x/path/text()')
item_info.add_xpath('description', './/some/other/x/path/text()')
return item_info.load_item()
This code runs without errors, but Scrapy reports 0 items crawled and 0 items scraped, so something is clearly going wrong.
I've attempted to debug by running this from the Scrapy shell. My results there suggest that the keywords and r steps are both behaving. If I define and call keywords using the method above for a .txt file containing the words red, blue, and green, I receive in response 'red|blue|green'. Defining and calling r as above gives me <_sre.SRE_Pattern object at 0x17bc980>, which I believe is the expected response. When I run r.match(response.body_as_unicode()), however, I receive no response, even on pages that I know contain one or more of my keywords.
Does anyone have thoughts as to what I'm missing here? As I understand it, whenever one of my keywords appears in the response.body, a match should be triggered and Scrapy should proceed to extract information from that response using the xpaths I've defined. Clearly I'm mistaken, but I'm not sure how or why.
Solution?
I think I may have this problem figured out at last. My current conclusion is that the difficulty was caused by performing r.match on the response.body_as_unicode. The documentation provided here says of match:
If zero or more characters at the beginning of string match the regular expression pattern, return a corresponding MatchObject instance. Return None if the string does not match the pattern; note that this is different from a zero-length match.
Note that even in MULTILINE mode, re.match() will only match at the beginning of the string and not at the beginning of each line.
That behaviour was not appropriate to my situation. I'm interested in identifying and scraping information from pages that contain my keywords anywhere within them, not those that feature one of my keywords as the first item on the page. To accomplish that task, I needed re.search, which scans through a string until it finds a match for the regex pattern generated by compile and returns a MatchObject, or else returns None when no match for the pattern.
My current (working!) code follows below. Note that in addition to the switch from match to search I've added a little bit to my definition of keywords to limit matches to whole words.
item_info = ItemLoader(item=info_categories(), response=response)
keywords = '|'.join(r"\b" + re.escape(word.strip()) + r"\b" for word in open('keys.txt'))
r = re.compile('.*(%s).*' % keywords, re.MULTILINE|re.UNICODE)
if r.search(response.body_as_unicode()):
item_info.add_xpath('title', './/some/x/path/text()')
item_info.add_xpath('description', './/some/other/x/path/text()')
return item_info.load_item()
regex is probably the fastest way to run the tests on a large number of page
import re
keywords = '|'.join(re.escape(word.strip()) for word in open('keywords.txt'))
r = re.compile('.*(%s).*' % keywords, re.MULTILINE|re.UNICODE)
if r.match(response.body_as_unicode()):
generating xpath expression on multiple keywords could work but you add the extra CPU load ( typically ~100ms) of parsing the page as XML before running XPATH.
You can also check if a keyword is inside the response.body:
source = response.body
with open('input.txt') as f:
for word in f:
print word, word.strip() in source
Or, using any():
with open('input.txt') as f:
print any(word.strip() in source for word in f)

extracting facebook page from html using regex

I am trying to get the address of a facebook page of websites using regular expression search on the html
usually the link appears as
Facebook
but sometimes the address will be http://www.facebook.com/some.other
and sometimes with numbers
at the moment the regex that I have is
'(facebook.com)\S\w+'
but it won't catch the last 2 possibilites
what is it called when I want the regex to search but not fetch it? (for instance I want the regex to match the www.facbook.com part but not have that part in the result, only the part that comes after it
note I use python with re and urllib2
seems to me your main issue is that you dont understand enough regex.
fb_re = re.compile(r'www.facebook.com([^"]+)')
then simply:
results = fb_re.findall(url)
why this works:
in regular expresions the part in the parenthesis () is what is captured, you were putting the www.facebook.com part in the parenthesis and so it was not getting anything else.
here i used a character set [] to match anything in there, i used the ^ operator to negate that, which means anything not in the set, and then i gave it the " character, so it will match anything that comes after www.facebook.com until it reaches a " and then stop.
note - this catches facebook links which are embedded, if the facebook link is simply on the page in plaintext you can use:
fb_re = re.compile(r'www.facebook.com(\S+)')
which means to grab any non-white-space character, so it will stop once it runs out of white-space.
if you are worried about links ending in periods, you can simply add:
fb_re = re.compile(r'www.facebook.com(\S+)\.\s')
which tells it to search for the same above, but stop when it gets to the end of a sentence, . followed by any white-space like a space or enter. this way it will still grab links like /some.other but when you have things like /some.other. it will remove the last .
if i assume correctly, the url is always in double quotes. right?
re.findall(r'"http://www.facebook.com(.+?)"',url)
Overall, trying to parse html with regex is a bad idea. I suggest you use an html parser like lxml.html to find the links and then use urlparse
>>> from urlparse import urlparse # in 3.x use from urllib.parse import urlparse
>>> url = 'http://www.facebook.com/some.other'
>>> parse_object = urlparse(url)
>>> parse_object.netloc
'facebook.com'
>>> parse_object.path
'/some.other'

python match image tags from large content string using regular expressions

am really a noob with regular expressions, I tried to do this on my own but I couldn't understand from the manuals how to approach it. Am trying to find all img tags of a given content, I wrote the below but its returning None
content = i.content[0].value
prog = re.compile(r'^<img')
result = prog.match(content)
print result
any suggestions?
Multipurpose solution:
image_re = re.compile(r"""
(?P<img_tag><img)\s+ #tag starts
[^>]*? #other attributes
src= #start of src attribute
(?P<quote>["''])? #optional open quote
(?P<image>[^"'>]+) #image file name
(?(quote)(?P=quote)) #close quote
[^>]*? #other attributes
> #end of tag
""", re.IGNORECASE|re.VERBOSE) #re.VERBOSE allows to define regex in readable format with comments
image_tags = []
for match in image_re.finditer(content):
image_tags.append(match.group("img_tag"))
#print found image_tags
for image_tag in image_tags:
print image_tag
As you can see in regex definition, it contains
(?P<group_name>regex)
It allows you to access found groups by group_name, and not by number. It is for readability. So, if you want to show all src attributes of img tags, then just write:
for match in image_re.finditer(content):
image_tags.append(match.group("image"))
After this image_tags list will contain src of image tags.
Also, if you need to parse html, then there are instruments that were designed exactly for such purposes. For example it is lxml, that use xpath expressions.
I don't know Python but assuming it uses normal Perl compatible regular expressions...
You probably want to look for "<img[^>]+>" which is: "<img", followed by anything that is not ">", followed by ">". Each match should give you a complete image tag.

replace URLs in text with links to URLs

Using Python I want to replace all URLs in a body of text with links to those URLs, like what Gmail does.
Can this be done in a one liner regular expression?
Edit: by body of text I just meant plain text - no HTML
You can load the document up with a DOM/HTML parsing library ( see html5lib ), grab all text nodes, match them against a regular expression and replace the text nodes with a regex replacement of the URI with anchors around it using a PCRE such as:
/(https?:[;\/?\\#&=+$,\[\]A-Za-z0-9\-_\.\!\~\*\'\(\)%][\;\/\?\:\#\&\=\+\$\,\[\]A-Za-z0-9\-_\.\!\~\*\'\(\)%#]*|[KZ]:\\*.*\w+)/g
I'm quite sure you can scourge through and find some sort of utility that does this, I can't think of any off the top of my head though.
Edit: Try using the answers here: How do I get python-markdown to additionally "urlify" links when formatting plain text?
import re
urlfinder = re.compile("([0-9]{1,3}\\.[0-9]{1,3}\\.[0-9]{1,3}\\.[0-9]{1,3}|((news|telnet|nttp|file|http|ftp|https)://)|(www|ftp)[-A-Za-z0-9]*\\.)[-A-Za-z0-9\\.]+):[0-9]*)?/[-A-Za-z0-9_\\$\\.\\+\\!\\*\\(\\),;:#&=\\?/~\\#\\%]*[^]'\\.}>\\),\\\"]")
def urlify2(value):
return urlfinder.sub(r'\1', value)
call urlify2 on a string and I think that's it if you aren't dealing with a DOM object.
I hunted around a lot, tried these solutions and was not happy with their readability or features, so I rolled the following:
_urlfinderregex = re.compile(r'http([^\.\s]+\.[^\.\s]*)+[^\.\s]{2,}')
def linkify(text, maxlinklength):
def replacewithlink(matchobj):
url = matchobj.group(0)
text = unicode(url)
if text.startswith('http://'):
text = text.replace('http://', '', 1)
elif text.startswith('https://'):
text = text.replace('https://', '', 1)
if text.startswith('www.'):
text = text.replace('www.', '', 1)
if len(text) > maxlinklength:
halflength = maxlinklength / 2
text = text[0:halflength] + '...' + text[len(text) - halflength:]
return '<a class="comurl" href="' + url + '" target="_blank" rel="nofollow">' + text + '<img class="imglink" src="/images/linkout.png"></a>'
if text != None and text != '':
return _urlfinderregex.sub(replacewithlink, text)
else:
return ''
You'll need to get a link out image, but that's pretty easy. This is specifically for user submitted text like comments which I assume is usually what people are dealing with.
/\w+:\/\/[^\s]+/
When you say "body of text" do you mean a plain text file, or body text in an HTML document? If you want the HTML document, you will want to use Beautiful Soup to parse it; then, search through the body text and insert the tags.
Matching the actual URLs is probably best done with the urlparse module. Full discussion here: How do you validate a URL with a regular expression in Python?
Gmail is a lot more open, when it comes to URLs, but it is not always right either. e.g. it will make www.a.b into a hyperlink as well as http://a.b but it often fails because of wrapped text and uncommon (but valid) URL characters.
See appendix A. A. Collected BNF for URI for syntax, and use that to build a reasonable regular expression that will consider what surrounds the URL as well. You'd be well advised to consider a couple of scenarios where URLs might end up.

Regex Matching Error

I am new to Python (I dont have any programming training either), so please keep that in mind as I ask my question.
I am trying to search a retrieved webpage and find all links using a specified pattern. I have done this successfully in other scripts, but I am getting an error that says
raise error, v # invalid expression
sre_constants.error: multiple repeat
I have to admit I do not know why, but again, I am new to Python and Regular Expressions. However, even when I don't use patterns and use a specific link (just to test the matching), I do not believe I return any matches (nothing is sent to the window when I print match.group(0). The link I tested is commented out below.
Any ideas? It usually is easier for me to learn by example, but any advice you can give is greatly appreciated!
Brock
import urllib2
from BeautifulSoup import BeautifulSoup
import re
url = "http://forums.epicgames.com/archive/index.php?f-356-p-164.html"
page = urllib2.urlopen(url).read()
soup = BeautifulSoup(page)
pattern = r'(.?+) <i>((.?+) replies)'
#pattern = r'href="http://forums.epicgames.com/archive/index.php?t-622233.html">Gears of War 2: Horde Gameplay</a> <i>(20 replies)'
for match in re.finditer(pattern, page, re.S):
print match(0)
That means your regular expression has an error.
(.?+)</a> <i>((.?+)
What does ?+ mean? Both ? and + are meta characters that does not make sense right next to each other. Maybe you forgot to escape the '?' or something.
You need to escape the literal '?' and the literal '(' and ')' that you are trying to match.
Also, instead of '?+', I think you're looking for the non-greedy matching provided by '+?'.
More documentation here.
For your case, try this:
pattern = r' (.+?) <i>\((.+?) replies\)'
As you're discovering, parsing arbitrary HTML is not easy to do correctly. That's what packages like Beautiful Soup do. Note, you're calling it in your script but then not using the results. Refer to its documentation here for examples of how to make your task a lot easier!
import urllib2
import re
from BeautifulSoup import BeautifulSoup
url = "http://forums.epicgames.com/archive/index.php?f-356-p-164.html"
page = urllib2.urlopen(url).read()
soup = BeautifulSoup(page)
# Get all the links
links = [str(match) for match in soup('a')]
s = r'(.+?)'
r = re.compile(s)
for link in links:
m = r.match(link)
if m:
print m.groups(1)[0]
To extend on what others wrote:
.? means "one or zero of any character"
.+ means "one ore more of any character"
As you can hopefully see, combining the two makes no sense; they are different and contradictory "repeat" characters. So, your error about "multiple repeats" is because you combined those two "repeat" characters in your regular expression. To fix it, just decide which one you actually meant to use, and delete the other.

Categories

Resources