Regular Expression to grab relative URL in crawled Javascript - python

I have a crawler setup with Scrapy and am trying to process links. The problem is the links are embedded in Javascript and I am struggling to create a regular expression. Here are 3 samples of what I am trying to process:
javascript:openInIFrame('main', 'setup.phtml%3f.op%3d3800%26.who%3dAAAAAAAAAAAA%26.menuItemRefNo=118')
javascript:window.open('overview.phtml?&.who=AAAAAAAAAAAA&.id=2', '43425235', 'menubar=no,toolbar=no,location=no,resizable=yes,maximize=yes');
javascript:openInIFrame('main', "page.phtml%3f.op%3d1499%26.who%3dAAAAAAAAAAAA%26.ifmod%3dtest&.menuItemRefNo=7")
The resulting relative URL for each would be between the single/double quotes:
setup.phtml%3f.op%3d3800%26.who%3dAAAAAAAAAAAA%26.menuItemRefNo=118
overview.phtml?&.who=AAAAAAAAAAAA&.id=2
page.phtml%3f.op%3d1499%26.who%3dAAAAAAAAAAAA%26.ifmod%3dtest&.menuItemRefNo=7
I have tried variations of '(.*?)' and (["'])(?:(?=(\\?))\2.)*?\1 but cannot seem to get it right. What am I missing here?

maybe try something like this:
['"].*phtml.*['"]
http://regex101.com/r/lX6xX8/1

Try this
import re
url_regex = re.compile(r"(?:javascript:openInIFrame\('main',|javascript:window.open\()\s*(?:'|\")([^'\"]+)(?:'|\")")
samples = [
"javascript:openInIFrame('main', 'setup.phtml%3f.op%3d3800%26.who%3dAAAAAAAAAAAA%26.menuItemRefNo=118')",
"javascript:window.open('overview.phtml?&.who=AAAAAAAAAAAA&.id=2', '43425235', 'menubar=no,toolbar=no,location=no,resizable=yes,maximize=yes');",
"javascript:openInIFrame('main', \"page.phtml%3f.op%3d1499%26.who%3dAAAAAAAAAAAA%26.ifmod%3dtest&.menuItemRefNo=7\")"
]
for sample in samples:
md = url_regex.search(sample)
if md:
print md.group(1)
else:
print 'NO MATCH'
For me, this outputs:
setup.phtml%3f.op%3d3800%26.who%3dAAAAAAAAAAAA%26.menuItemRefNo=118
overview.phtml?&.who=AAAAAAAAAAAA&.id=2
page.phtml%3f.op%3d1499%26.who%3dAAAAAAAAAAAA%26.ifmod%3dtest&.menuItemRefNo=7
The trick is the ([^'\"]+). This captures any sequence of one or more characters, so long as the character is not a double or single quote. So basically, everything up to the end of the URL string, which is precisely the URL. Note that the \" is only necessary because the regex itself is delimited with "

Related

How to use Regex to extract a string from a specific string until a specific symbol in python?

Question
Assume that I have a string like this:
example_text = 'b\'\\x08\\x13"\\\\https://www.example.com/link_1.html\\xd2\\x01`https://www.example.com/link_2.html\''
Expectation
And I want to only extract the first url, which is
output = "https://www.example.com/link_1.html"
I think using regex to find the url start from "https" and end up '\' will be a good solution.
If so, how can I write the regex pattern?
I try something like this:
`
re.findall("https://([^\\\\)]+)", example_text)
output = ['www.example.com/link_1.html', 'www.example.com/link_2.html']
But then, I need to add "https://" back and choose the first item in the return.
Is there any other solution?
You need to tweak your regex a bit.
What you were doing before:
https://([^\\\\)]+) this matches your link but only captures the part after https:// since you used the capturing token after that.
Updated Regex:
(https\:\/\/[^\\\\)]+) this matches the link and also captures the whole token (escaped special characters to avoid errors)
In Code:
import re
input = 'b\'\\x08\\x13"\\\\https://www.example.com/link_1.html\\xd2\\x01`https://www.example.com/link_2.html\''
print(re.findall("(https\:\/\/[^\\\\)]+)", input))
Output:
['https://www.example.com/link_1.html', "https://www.example.com/link_2.html'"]
You could also use (https\:\/\/([^\\\\)]+).html) to get the link with https:// and without it as a tuple. (this also avoids the ending ' that you might get in some links)
If you want only the first one, simply do output[0].
Try:
match = re.search(r"https://[^\\']+", example_text)
url = match.group()
print(url)
output:
https://www.example.com/link_1.html

extract URL from string in python

I want to extract a full URL from a string.
My code is:
import re
data = "ahahahttp://www.google.com/a.jpg>hhdhd"
print re.match(r'(ftp|http)://.*\.(jpg|png)$', data)
Output:
None
Expected Output
http://www.google.com/a.jpg
I found so many questions on StackOverflow, but none worked for me.
I have seen many posts and this is not a duplicate. Please help me! Thanks.
You were close!
Try this instead:
r'(ftp|http)://.*\.(jpg|png)'
You can visualize this here.
I would also make this non-greedy like this:
r'(ftp|http)://.*?\.(jpg|png)'
You can visualize this greedy vs. non-greedy behavior here and here.
By default, .* will match as much text as possible, but you want to match as little text as possible.
Your $ anchors the match at the end of the line, but the end of the URL is not the end of the line, in your example.
Another problem is that you're using re.match() and not re.search(). Using re.match() starts the match at the beginning of the string, and re.search() searches anywhere in the string. See here for more information.
You should use search instead of match.
import re
data = "ahahahttp://www.google.com/a.jpg>hhdhd"
url=re.search('(ftp|http)://.*\.(jpg|png)', data)
if url:
print url.group(0)
Find the start of the url by using find(http:// , ftp://) . Find the end of url using find(jpg , png). Now get the substring
data = "ahahahttp://www.google.com/a.jpg>hhdhd"
start = data.find('http://')
kk = data[start:]
end = kk.find('.jpg')
print kk[0:end+4]

Using re.findall() in Python for Web Crawling

I am trying to teach myself Python by writing a very simple web crawler with it.
The code for it is here:
#!/usr/bin/python
import sys, getopt, time, urllib, re
LINK_INDEX = 1
links = [sys.argv[len(sys.argv) - 1]]
visited = []
politeness = 10
maxpages = 20
def print_usage():
print "USAGE:\n./crawl [-politeness <seconds>] [-maxpages <pages>] seed_url"
def parse_args():
#code for parsing arguments (works fine so didnt need to be included here)
def crawl():
global links, visited
url = links.pop()
visited.append(url)
print "\ncurrent url: %s" % url
response = urllib.urlopen(url)
html = response.read()
html = html.lower()
raw_links = re.findall(r'<a href="[\w\.-]+"', html)
print "found: %d" % len(raw_links)
for raw_link in raw_links:
temp = raw_link.split('"')
if temp[LINK_INDEX] not in visited and temp[LINK_INDEX] not in links:
links.append(temp[LINK_INDEX])
print "\nunvisited:"
for link in links:
print link
print "\nvisited:"
for link in visited:
print link
parse_args()
while len(visited) < maxpages and len(links) > 0:
crawl()
time.sleep(politeness)
print "politeness = %d, maxpages = %d" % (politeness, maxpages)
I created a small test network in the same working directory of about 10 pages that all link together in various ways, and it seems to work fine, but when I send it out onto the actual internet by itself, it is unable to parse links from files it gets.
It is able to get the html code fine, because I can print that out, but it seems that the re.findall() part is not doing what it is supposed to, because the links list never gets populated. Have I maybe written my regex wrong? It worked fine to find strings like <a href="test02.html" and then parse the link from that, but for some reason, it isn't working for actual web pages. It might be the http part perhaps that is throwing it off?
I've never used regex with Python before so I'm pretty sure that this is the problem. Can anyone give me any idea how express the pattern I am looking for better? Thanks!
The problem is with your regex. There are a whole bunch of ways I could write a valid HTML anchor that your regex wouldn't match. For example, there could be extra whitespace, or line breaks in it, and there are other attributes that could exist that you haven't taken into account. Also, you take no account of different case. For example:
foo
foo
<a class="bar" href="foo">foo</a>
None of these would be matched by your regex.
You probably want something more like this:
<a[^>]*href="(.*?)"
This will match an anchor tag start, followed by any characters other than > (so that we're still matching inside the tag). This might be things like a class or id attribute. The value of the href attribute is then captured in a capture group, which you can extract by
match.group(1)
The match for the href value is also non-greedy. This means it will match the smallest match possible. This is because otherwise if you have other tags on the same line, you'll match beyond what you want to.
Finally, you'll need to add the re.I flag to match in a case insensitive way.
Your regexp doesn't match all valid values for the href attributes, such as path with slashes, and so on. Using [^"]+ (anything different from the closing double quote) instead of [\w\.-]+ would help, but it doesn't matter becauseā€¦ you should not parse HTML with regexps to begin with.
Lev already mentionned BeautifulSoup, you could also look at lxml. It will work better that any hand-crafted regexp you could write.
You probably want this:
raw_links = re.findall(r'<a href="(.+?)"', html)
Use the brackets to indicate what you want returned, otherwise you get the whole match including the <a href=... bit. Now you get everything until the closing quote mark, due to the use of a non-greedy +? operator.
A more discriminating filter might be:
raw_links = re.findall(r'<a href="([^">]+?)"', html)
this matches anything except a quote and a terminating bracket.
These simple RE's will match to URL's that have been commented, URL-like literal strings inside bits of javascript, etc. So be careful about using the results!

extracting facebook page from html using regex

I am trying to get the address of a facebook page of websites using regular expression search on the html
usually the link appears as
Facebook
but sometimes the address will be http://www.facebook.com/some.other
and sometimes with numbers
at the moment the regex that I have is
'(facebook.com)\S\w+'
but it won't catch the last 2 possibilites
what is it called when I want the regex to search but not fetch it? (for instance I want the regex to match the www.facbook.com part but not have that part in the result, only the part that comes after it
note I use python with re and urllib2
seems to me your main issue is that you dont understand enough regex.
fb_re = re.compile(r'www.facebook.com([^"]+)')
then simply:
results = fb_re.findall(url)
why this works:
in regular expresions the part in the parenthesis () is what is captured, you were putting the www.facebook.com part in the parenthesis and so it was not getting anything else.
here i used a character set [] to match anything in there, i used the ^ operator to negate that, which means anything not in the set, and then i gave it the " character, so it will match anything that comes after www.facebook.com until it reaches a " and then stop.
note - this catches facebook links which are embedded, if the facebook link is simply on the page in plaintext you can use:
fb_re = re.compile(r'www.facebook.com(\S+)')
which means to grab any non-white-space character, so it will stop once it runs out of white-space.
if you are worried about links ending in periods, you can simply add:
fb_re = re.compile(r'www.facebook.com(\S+)\.\s')
which tells it to search for the same above, but stop when it gets to the end of a sentence, . followed by any white-space like a space or enter. this way it will still grab links like /some.other but when you have things like /some.other. it will remove the last .
if i assume correctly, the url is always in double quotes. right?
re.findall(r'"http://www.facebook.com(.+?)"',url)
Overall, trying to parse html with regex is a bad idea. I suggest you use an html parser like lxml.html to find the links and then use urlparse
>>> from urlparse import urlparse # in 3.x use from urllib.parse import urlparse
>>> url = 'http://www.facebook.com/some.other'
>>> parse_object = urlparse(url)
>>> parse_object.netloc
'facebook.com'
>>> parse_object.path
'/some.other'

Regex Matching Error

I am new to Python (I dont have any programming training either), so please keep that in mind as I ask my question.
I am trying to search a retrieved webpage and find all links using a specified pattern. I have done this successfully in other scripts, but I am getting an error that says
raise error, v # invalid expression
sre_constants.error: multiple repeat
I have to admit I do not know why, but again, I am new to Python and Regular Expressions. However, even when I don't use patterns and use a specific link (just to test the matching), I do not believe I return any matches (nothing is sent to the window when I print match.group(0). The link I tested is commented out below.
Any ideas? It usually is easier for me to learn by example, but any advice you can give is greatly appreciated!
Brock
import urllib2
from BeautifulSoup import BeautifulSoup
import re
url = "http://forums.epicgames.com/archive/index.php?f-356-p-164.html"
page = urllib2.urlopen(url).read()
soup = BeautifulSoup(page)
pattern = r'(.?+) <i>((.?+) replies)'
#pattern = r'href="http://forums.epicgames.com/archive/index.php?t-622233.html">Gears of War 2: Horde Gameplay</a> <i>(20 replies)'
for match in re.finditer(pattern, page, re.S):
print match(0)
That means your regular expression has an error.
(.?+)</a> <i>((.?+)
What does ?+ mean? Both ? and + are meta characters that does not make sense right next to each other. Maybe you forgot to escape the '?' or something.
You need to escape the literal '?' and the literal '(' and ')' that you are trying to match.
Also, instead of '?+', I think you're looking for the non-greedy matching provided by '+?'.
More documentation here.
For your case, try this:
pattern = r' (.+?) <i>\((.+?) replies\)'
As you're discovering, parsing arbitrary HTML is not easy to do correctly. That's what packages like Beautiful Soup do. Note, you're calling it in your script but then not using the results. Refer to its documentation here for examples of how to make your task a lot easier!
import urllib2
import re
from BeautifulSoup import BeautifulSoup
url = "http://forums.epicgames.com/archive/index.php?f-356-p-164.html"
page = urllib2.urlopen(url).read()
soup = BeautifulSoup(page)
# Get all the links
links = [str(match) for match in soup('a')]
s = r'(.+?)'
r = re.compile(s)
for link in links:
m = r.match(link)
if m:
print m.groups(1)[0]
To extend on what others wrote:
.? means "one or zero of any character"
.+ means "one ore more of any character"
As you can hopefully see, combining the two makes no sense; they are different and contradictory "repeat" characters. So, your error about "multiple repeats" is because you combined those two "repeat" characters in your regular expression. To fix it, just decide which one you actually meant to use, and delete the other.

Categories

Resources