How to match URLs with python regular expression?

How to match URLs with python regular expression? - python

My problem is, that I want to match URLs in HTML code, which look like so: href='example.com' or using ", but I only want to extract the actual URL. I tried matching it, and then using array magic to only get the array, but since the regex match is greedy, if there is more than 1 rational match, there will be lots more which start at one ' and end at another URL's '. What regex will suit my needs?

I would recommend NOT using regex to parse HTML. Your life will be much easier if you use something like beautifulsoup!
It's as easy as this:
from BeautifulSoup import BeautifulSoup
HTML = """firstoneIhaveurls"""
s = BeautifulSoup(HTML)
for href in s.find_all('a', href=True): print("My URL: ", href['href'])

In case if you want it to solve it using regular expression instead of using other libraries of python. Here is the solution.
import re
html = ''
pattern = r'href=\"(.*)\"|href=\'(.*)\''
multiple_match_links = re.findall(pattern,html)
if(len(multiple_match_links) == 0):
print("No Link Found")
else:
print([x for x in list(multiple_match_links[0]) if len(x) > 0][0])

Related

I'm having trouble with web scraping to Python

I'm very new to coding and I've tried to write a code that imports the current price of litecoin from coinmarketcap. However, I can't get it to work, it prints and empty list.
import urllib
import re
htmlfile = urllib.urlopen('https://coinmarketcap.com/currencies/litecoin/')
htmltext = htmlfile.read()
regex = 'span class="text-large2" data-currency-value="">$304.08</span>'
pattern = re.compile(regex)
price = re.findall(pattern, htmltext)
print(price)
Out comes "[]" . The problem is probably minor, but I'm very appreciative for the help.

Regular expressions are generally not the best tool for processing HTML. I suggest looking at something like BeautifulSoup.
For example:
import urllib
import bs4
f = urllib.urlopen("https://coinmarketcap.com/currencies/litecoin/")
soup = bs4.BeautifulSoup(f)
print(soup.find("", {"data-currency-value": True}).text)
This currently prints "299.97".
This probably does not perform as well as using a re for this simple case. However, see Using regular expressions to parse HTML: why not?

You need to change your RegEx and add a group in parenthesis to capture the value.
Try to match something like: <span class="text-large2" data-currency-value>300.59</span>, you need this RegEx:
regex = 'span class="text-large2" data-currency-value>(.*?)</span>'
The (.*?) group is used to catch the number.
You get:
['300.59']

Using re.findall() in Python for Web Crawling

I am trying to teach myself Python by writing a very simple web crawler with it.
The code for it is here:
#!/usr/bin/python
import sys, getopt, time, urllib, re
LINK_INDEX = 1
links = [sys.argv[len(sys.argv) - 1]]
visited = []
politeness = 10
maxpages = 20
def print_usage():
print "USAGE:\n./crawl [-politeness <seconds>] [-maxpages <pages>] seed_url"
def parse_args():
#code for parsing arguments (works fine so didnt need to be included here)
def crawl():
global links, visited
url = links.pop()
visited.append(url)
print "\ncurrent url: %s" % url
response = urllib.urlopen(url)
html = response.read()
html = html.lower()
raw_links = re.findall(r'<a href="[\w\.-]+"', html)
print "found: %d" % len(raw_links)
for raw_link in raw_links:
temp = raw_link.split('"')
if temp[LINK_INDEX] not in visited and temp[LINK_INDEX] not in links:
links.append(temp[LINK_INDEX])
print "\nunvisited:"
for link in links:
print link
print "\nvisited:"
for link in visited:
print link
parse_args()
while len(visited) < maxpages and len(links) > 0:
crawl()
time.sleep(politeness)
print "politeness = %d, maxpages = %d" % (politeness, maxpages)
I created a small test network in the same working directory of about 10 pages that all link together in various ways, and it seems to work fine, but when I send it out onto the actual internet by itself, it is unable to parse links from files it gets.
It is able to get the html code fine, because I can print that out, but it seems that the re.findall() part is not doing what it is supposed to, because the links list never gets populated. Have I maybe written my regex wrong? It worked fine to find strings like <a href="test02.html" and then parse the link from that, but for some reason, it isn't working for actual web pages. It might be the http part perhaps that is throwing it off?
I've never used regex with Python before so I'm pretty sure that this is the problem. Can anyone give me any idea how express the pattern I am looking for better? Thanks!

The problem is with your regex. There are a whole bunch of ways I could write a valid HTML anchor that your regex wouldn't match. For example, there could be extra whitespace, or line breaks in it, and there are other attributes that could exist that you haven't taken into account. Also, you take no account of different case. For example:
foo
foo
<a class="bar" href="foo">foo</a>
None of these would be matched by your regex.
You probably want something more like this:
<a[^>]*href="(.*?)"
This will match an anchor tag start, followed by any characters other than > (so that we're still matching inside the tag). This might be things like a class or id attribute. The value of the href attribute is then captured in a capture group, which you can extract by
match.group(1)
The match for the href value is also non-greedy. This means it will match the smallest match possible. This is because otherwise if you have other tags on the same line, you'll match beyond what you want to.
Finally, you'll need to add the re.I flag to match in a case insensitive way.

Your regexp doesn't match all valid values for the href attributes, such as path with slashes, and so on. Using [^"]+ (anything different from the closing double quote) instead of [\w\.-]+ would help, but it doesn't matter because… you should not parse HTML with regexps to begin with.
Lev already mentionned BeautifulSoup, you could also look at lxml. It will work better that any hand-crafted regexp you could write.

You probably want this:
raw_links = re.findall(r'<a href="(.+?)"', html)
Use the brackets to indicate what you want returned, otherwise you get the whole match including the <a href=... bit. Now you get everything until the closing quote mark, due to the use of a non-greedy +? operator.
A more discriminating filter might be:
raw_links = re.findall(r'<a href="([^">]+?)"', html)
this matches anything except a quote and a terminating bracket.
These simple RE's will match to URL's that have been commented, URL-like literal strings inside bits of javascript, etc. So be careful about using the results!

Match all urls that aren't wrapped into <a> tag

I am seeking for a regular expression pattern that could match urls in HTML that aren't wrapped into 'a' tag, in order to wrap them into 'a' tag further (i.e. highlight all non-highlighted links).
Input is simple HTML with 'a', 'b', 'i', 'br', 'p' 'img' tags allowed. All other HTML tags shouldn't appear in the input, but tags mentioned above could appear in any combinations.
So pattern should omit all urls that are parts of existing 'a' tags, and match all other links that are just plain text not wrapped into 'a' tags and thus are not highlighted and are not hyperlinks yet. It would be good if pattern will match urls beginning with http://, https:// or www., and ending with .net, .com. or .org if the url isn't begin with http://, https:// or www.
I've tried something like '(?!<[aA][^>]+>)http://[a-zA-Z0-9._-]+(?!)' to match more simple case than I described above, but it seems that this task is not so obvious.
Thanks much for any help.

You could use BeautifulSoup or similar to exclude all urls that are already part of links.
Then you can match the plain text with one of the url regular expressions that's already out there (google "url regular expression", which one you want depends on how fancy you want to get).

Parsing HTML with a single regex is almost impossible by definition, since regexes don't have state.
Build/Use a real parser instead. Maybe BeautifulSoup or html5lib.
This code below uses BeautifulSoup to extract all links from the page:
from BeautifulSoup import BeautifulSoup
from urllib2 import urlopen
url = 'http://stackoverflow.com/questions/1296778/'
stream = urlopen(url)
soup = BeautifulSoup(stream)
for link in soup.findAll('a'):
if link.has_key('href'):
print unicode(link.string), '->', link['href']
Similarly you could find all text using soup.findAll(text=True) and search for urls there.
Searching for urls is also very complex - you wouldn't believe on what's allowed on a url. A simple search shows thousands of examples, but none match exactly the specs. You should try what works better for you.

Thanks guys! Below is my solution:
from django.utils.html import urlize # Yes, I am using Django's urlize to do all dirty work :)
def urlize_html(value):
"""
Urlizes text containing simple HTML tags.
"""
A_IMG_REGEX = r'(<[aA][^>]+>[^<]+</[aA]>|<[iI][mM][gG][^>]+>)'
a_img_re = re.compile(A_IMG_REGEX)
TAG_REGEX = r'(<[a-zA-Z]+[^>]+>|</[a-zA-Z]>)'
tag_re = re.compile(TAG_REGEX)
def process(s, p, f):
return "".join([c if p.match(c) else f(c) for c in p.split(s)])
def process_urlize(s):
return process(s, tag_re, urlize)
return process(value, a_img_re, process_urlize)

Regex Matching Error

I am new to Python (I dont have any programming training either), so please keep that in mind as I ask my question.
I am trying to search a retrieved webpage and find all links using a specified pattern. I have done this successfully in other scripts, but I am getting an error that says
raise error, v # invalid expression
sre_constants.error: multiple repeat
I have to admit I do not know why, but again, I am new to Python and Regular Expressions. However, even when I don't use patterns and use a specific link (just to test the matching), I do not believe I return any matches (nothing is sent to the window when I print match.group(0). The link I tested is commented out below.
Any ideas? It usually is easier for me to learn by example, but any advice you can give is greatly appreciated!
Brock
import urllib2
from BeautifulSoup import BeautifulSoup
import re
url = "http://forums.epicgames.com/archive/index.php?f-356-p-164.html"
page = urllib2.urlopen(url).read()
soup = BeautifulSoup(page)
pattern = r'(.?+) <i>((.?+) replies)'
#pattern = r'href="http://forums.epicgames.com/archive/index.php?t-622233.html">Gears of War 2: Horde Gameplay</a> <i>(20 replies)'
for match in re.finditer(pattern, page, re.S):
print match(0)

That means your regular expression has an error.
(.?+)</a> <i>((.?+)
What does ?+ mean? Both ? and + are meta characters that does not make sense right next to each other. Maybe you forgot to escape the '?' or something.

You need to escape the literal '?' and the literal '(' and ')' that you are trying to match.
Also, instead of '?+', I think you're looking for the non-greedy matching provided by '+?'.
More documentation here.
For your case, try this:
pattern = r' (.+?) <i>\((.+?) replies\)'

As you're discovering, parsing arbitrary HTML is not easy to do correctly. That's what packages like Beautiful Soup do. Note, you're calling it in your script but then not using the results. Refer to its documentation here for examples of how to make your task a lot easier!

import urllib2
import re
from BeautifulSoup import BeautifulSoup
url = "http://forums.epicgames.com/archive/index.php?f-356-p-164.html"
page = urllib2.urlopen(url).read()
soup = BeautifulSoup(page)
# Get all the links
links = [str(match) for match in soup('a')]
s = r'(.+?)'
r = re.compile(s)
for link in links:
m = r.match(link)
if m:
print m.groups(1)[0]

To extend on what others wrote:
.? means "one or zero of any character"
.+ means "one ore more of any character"
As you can hopefully see, combining the two makes no sense; they are different and contradictory "repeat" characters. So, your error about "multiple repeats" is because you combined those two "repeat" characters in your regular expression. To fix it, just decide which one you actually meant to use, and delete the other.

Extract part of a regex match

I want a regular expression to extract the title from a HTML page. Currently I have this:
title = re.search('<title>.*</title>', html, re.IGNORECASE).group()
if title:
title = title.replace('<title>', '').replace('</title>', '')
Is there a regular expression to extract just the contents of <title> so I don't have to remove the tags?

Use ( ) in regexp and group(1) in python to retrieve the captured string (re.search will return None if it doesn't find the result, so don't use group() directly):
title_search = re.search('<title>(.*)</title>', html, re.IGNORECASE)
if title_search:
title = title_search.group(1)

Note that starting in Python 3.8, and the introduction of assignment expressions (PEP 572) (:= operator), it's possible to improve a bit on Krzysztof Krasoń's solution by capturing the match result directly within the if condition as a variable and re-use it in the condition's body:
# pattern = '<title>(.*)</title>'
# text = '<title>hello</title>'
if match := re.search(pattern, text, re.IGNORECASE):
title = match.group(1)
# hello

Try using capturing groups:
title = re.search('<title>(.*)</title>', html, re.IGNORECASE).group(1)

May I recommend you to Beautiful Soup. Soup is a very good lib to parse all of your html document.
soup = BeatifulSoup(html_doc)
titleName = soup.title.name

Try:
title = re.search('<title>(.*)</title>', html, re.IGNORECASE).group(1)

re.search('<title>(.*)</title>', s, re.IGNORECASE).group(1)

The provided pieces of code do not cope with Exceptions
May I suggest
getattr(re.search(r"<title>(.*)</title>", s, re.IGNORECASE), 'groups', lambda:[u""])()[0]
This returns an empty string by default if the pattern has not been found, or the first match.

I'd think this should suffice:
#!python
import re
pattern = re.compile(r'<title>([^<]*)</title>', re.MULTILINE|re.IGNORECASE)
pattern.search(text)
... assuming that your text (HTML) is in a variable named "text."
This also assumes that there are no other HTML tags which can be legally embedded inside of an HTML TITLE tag and there exists no way to legally embed any other < character within such a container/block.
However ...
Don't use regular expressions for HTML parsing in Python. Use an HTML parser! (Unless you're going to write a full parser, which would be a of extra, and redundant work when various HTML, SGML and XML parsers are already in the standard libraries).
If you're handling "real world" tag soup HTML (which is frequently non-conforming to any SGML/XML validator) then use the BeautifulSoup package. It isn't in the standard libraries (yet) but is widely recommended for this purpose.
Another option is: lxml ... which is written for properly structured (standards conformant) HTML. But it has an option to fallback to using BeautifulSoup as a parser: ElementSoup.

The currently top-voted answer by Krzysztof Krasoń fails with <title>a</title><title>b</title>. Also, it ignores title tags crossing line boundaries, e.g., for line-length reasons. Finally, it fails with <title >a</title> (which is valid HTML: White space inside XML/HTML tags).
I therefore propose the following improvement:
import re
def search_title(html):
m = re.search(r"<title\s*>(.*?)</title\s*>", html, re.IGNORECASE | re.DOTALL)
return m.group(1) if m else None
Test cases:
print(search_title("<title >with spaces in tags</title >"))
print(search_title("<title\n>with newline in tags</title\n>"))
print(search_title("<title>first of two titles</title><title>second title</title>"))
print(search_title("<title>with newline\n in title</title\n>"))
Output:
with spaces in tags
with newline in tags
first of two titles
with newline
in title
Ultimately, I go along with others recommending an HTML parser - not only, but also to handle non-standard use of HTML tags.

I needed something to match package-0.0.1 (name, version) but want to reject an invalid version such as 0.0.010.
See regex101 example.
import re
RE_IDENTIFIER = re.compile(r'^([a-z]+)-((?:(?:0|[1-9](?:[0-9]+)?)\.){2}(?:0|[1-9](?:[0-9]+)?))$')
example = 'hello-0.0.1'
if match := RE_IDENTIFIER.search(example):
name, version = match.groups()
print(f'Name: {name}')
print(f'Version: {version}')
else:
raise ValueError(f'Invalid identifier {example}')
Output:
Name: hello
Version: 0.0.1

Is there a particular reason why no one suggested using lookahead and lookbehind? I got here trying to do the exact same thing and (?<=<title>).+(?=<\/title>) works great. It will only match whats between parentheses so you don't have to do the whole group thing.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

How to match URLs with python regular expression? - python

Related

I'm having trouble with web scraping to Python

Using re.findall() in Python for Web Crawling

Match all urls that aren't wrapped into <a> tag

Regex Matching Error

Extract part of a regex match

Categories

Resources