extracting facebook page from html using regex - python

I am trying to get the address of a facebook page of websites using regular expression search on the html
usually the link appears as
Facebook
but sometimes the address will be http://www.facebook.com/some.other
and sometimes with numbers
at the moment the regex that I have is
'(facebook.com)\S\w+'
but it won't catch the last 2 possibilites
what is it called when I want the regex to search but not fetch it? (for instance I want the regex to match the www.facbook.com part but not have that part in the result, only the part that comes after it
note I use python with re and urllib2

seems to me your main issue is that you dont understand enough regex.
fb_re = re.compile(r'www.facebook.com([^"]+)')
then simply:
results = fb_re.findall(url)
why this works:
in regular expresions the part in the parenthesis () is what is captured, you were putting the www.facebook.com part in the parenthesis and so it was not getting anything else.
here i used a character set [] to match anything in there, i used the ^ operator to negate that, which means anything not in the set, and then i gave it the " character, so it will match anything that comes after www.facebook.com until it reaches a " and then stop.
note - this catches facebook links which are embedded, if the facebook link is simply on the page in plaintext you can use:
fb_re = re.compile(r'www.facebook.com(\S+)')
which means to grab any non-white-space character, so it will stop once it runs out of white-space.
if you are worried about links ending in periods, you can simply add:
fb_re = re.compile(r'www.facebook.com(\S+)\.\s')
which tells it to search for the same above, but stop when it gets to the end of a sentence, . followed by any white-space like a space or enter. this way it will still grab links like /some.other but when you have things like /some.other. it will remove the last .

if i assume correctly, the url is always in double quotes. right?
re.findall(r'"http://www.facebook.com(.+?)"',url)
Overall, trying to parse html with regex is a bad idea. I suggest you use an html parser like lxml.html to find the links and then use urlparse
>>> from urlparse import urlparse # in 3.x use from urllib.parse import urlparse
>>> url = 'http://www.facebook.com/some.other'
>>> parse_object = urlparse(url)
>>> parse_object.netloc
'facebook.com'
>>> parse_object.path
'/some.other'

Related

Capturing data from a web page using PYTHON

I want to capture texts from the below link and save it.
http://forecast.weather.gov/product.php?site=NWS&issuedby=FWD&product=RR5&format=CI&version=44&glossary=0
I need to save only the texts after .A, so I do not need the other texts in the page. Moreover, there are 50 different links at top of the page that I want to get all of the data from all of them.
I have written the below code but it returns nothing, how can specifically get part that I need?
import urllib
import re
htmlfile=urllib.urlopen("http://forecast.weather.gov/product.php?site=NWS&issuedby=FWD&product=RR5&format=CI&version=1&glossary=0")
htmltext=htmlfile.read()
regex='<pre class="glossaryProduct">(.+?)</pre>'
pattern=re.compile(regex)
out=re.findall(pattern, htmltext)
print (out)
I also used the following that returns all the content of the page:
import urllib
file1 = urllib.urlopen('http://forecast.weather.gov/product.php?site=NWS&issuedby=FWD&product=RR5&format=txt&version=1&glossary=0')
s1 = file1.read()
print(s1)
Can you help me to do so?
Your regex is not capturing anything because your content starts with a newline, and you did not enable your . to include newlines. If you change your compile line to
pattern=re.compile(regex,re.S)
It should work.
Also you may want to look at:
https://regex101.com
It shows you EXACTLY what your regex is doing. When i put the S flag on the right side, it started working exactly as it should:
Image of regex working with the S flag

Regex to capture url until a certain character

With a url such as
https://search.yahoo.com/search?p=Fetty+Wap&fr=fp-tts&
I am using
pat = re.compile('<a href="(https?://.*?)".*',re.DOTALL)
as a search pattern.
I want to pick any url like the yahoo url above, but I want to capture the url up to the literal ? in the actual url.
In other words I want to extract the url up to ?, knowing that all the urls I'm parsing don't have the ? character. In such a case I need to capture all of the url.
The above regex works and extracts the url but goes to the end of the url. How can I get it to stop at the first ? it encounters, and keep going to the end if it doesn't encounter a ?
Regex is really the wrong tool for the job. Doing a basic string split will get you exactly what you want.
def beforeQuestionMrk(inputStr):
return inputStr.split("?")[0]
url = "https://search.yahoo.com/sometext"
url2 = "https://search.yahoo.com/search?p=Fetty+Wap&fr=fp-tts&"
print(beforeQuestionMrk(url))
print(beforeQuestionMrk(url2))
#https://search.yahoo.com/sometext
#https://search.yahoo.com/search
If you really wanted wanted to use regex I suppose you could fo the following:
import re
def getBeforeQuestRegex(inputStr):
return re.search(r"(.+?\?|.+)", inputStr).group(0)
print(getBeforeQuestRegex("https://search.yahoo.com/search?p=Fetty+Wap&fr=fp-tts&"))
print(getBeforeQuestRegex("https://search.yahoo.com/sometext"))
#https://search.yahoo.com/search?
#https://search.yahoo.com/sometext
Bobble bubbles solution above worked very well for me;
"You can try like this by use of negated class: ]*?href="(http[^"?]+)"<- bobbles answer.
url looks like this
https://search.yahoo.com/search?p=Justin+Bieber&fr=fp-tts&fr2=p:fp,m:tn,ct:all......
or it could be something like this
https://www.yahoo.com/style/5-joyful-bob-ross-tees-202237009.html
objective was to extract full url if there was no literal ? in it, but if it did to stop just before the literal ?.
was Bobble Bubbles answer and works very cleanly, does what I wanted done, Again thank you for everyone in participating in this discussion, really appreciate it.
I agree with other answer, that using regexp here is not a solution, especially because there my be any number of parameters before opening of the <a> tag and href parameter, there can be a new line in between too.
but, answering to the initial question:
'*', '+', and '?' qualifiers are all greedy - they match as much text as possible
that's why there are non-greedy versions of them:
'*?', '+?' and '??'

Using re.findall() in Python for Web Crawling

I am trying to teach myself Python by writing a very simple web crawler with it.
The code for it is here:
#!/usr/bin/python
import sys, getopt, time, urllib, re
LINK_INDEX = 1
links = [sys.argv[len(sys.argv) - 1]]
visited = []
politeness = 10
maxpages = 20
def print_usage():
print "USAGE:\n./crawl [-politeness <seconds>] [-maxpages <pages>] seed_url"
def parse_args():
#code for parsing arguments (works fine so didnt need to be included here)
def crawl():
global links, visited
url = links.pop()
visited.append(url)
print "\ncurrent url: %s" % url
response = urllib.urlopen(url)
html = response.read()
html = html.lower()
raw_links = re.findall(r'<a href="[\w\.-]+"', html)
print "found: %d" % len(raw_links)
for raw_link in raw_links:
temp = raw_link.split('"')
if temp[LINK_INDEX] not in visited and temp[LINK_INDEX] not in links:
links.append(temp[LINK_INDEX])
print "\nunvisited:"
for link in links:
print link
print "\nvisited:"
for link in visited:
print link
parse_args()
while len(visited) < maxpages and len(links) > 0:
crawl()
time.sleep(politeness)
print "politeness = %d, maxpages = %d" % (politeness, maxpages)
I created a small test network in the same working directory of about 10 pages that all link together in various ways, and it seems to work fine, but when I send it out onto the actual internet by itself, it is unable to parse links from files it gets.
It is able to get the html code fine, because I can print that out, but it seems that the re.findall() part is not doing what it is supposed to, because the links list never gets populated. Have I maybe written my regex wrong? It worked fine to find strings like <a href="test02.html" and then parse the link from that, but for some reason, it isn't working for actual web pages. It might be the http part perhaps that is throwing it off?
I've never used regex with Python before so I'm pretty sure that this is the problem. Can anyone give me any idea how express the pattern I am looking for better? Thanks!
The problem is with your regex. There are a whole bunch of ways I could write a valid HTML anchor that your regex wouldn't match. For example, there could be extra whitespace, or line breaks in it, and there are other attributes that could exist that you haven't taken into account. Also, you take no account of different case. For example:
foo
foo
<a class="bar" href="foo">foo</a>
None of these would be matched by your regex.
You probably want something more like this:
<a[^>]*href="(.*?)"
This will match an anchor tag start, followed by any characters other than > (so that we're still matching inside the tag). This might be things like a class or id attribute. The value of the href attribute is then captured in a capture group, which you can extract by
match.group(1)
The match for the href value is also non-greedy. This means it will match the smallest match possible. This is because otherwise if you have other tags on the same line, you'll match beyond what you want to.
Finally, you'll need to add the re.I flag to match in a case insensitive way.
Your regexp doesn't match all valid values for the href attributes, such as path with slashes, and so on. Using [^"]+ (anything different from the closing double quote) instead of [\w\.-]+ would help, but it doesn't matter becauseā€¦ you should not parse HTML with regexps to begin with.
Lev already mentionned BeautifulSoup, you could also look at lxml. It will work better that any hand-crafted regexp you could write.
You probably want this:
raw_links = re.findall(r'<a href="(.+?)"', html)
Use the brackets to indicate what you want returned, otherwise you get the whole match including the <a href=... bit. Now you get everything until the closing quote mark, due to the use of a non-greedy +? operator.
A more discriminating filter might be:
raw_links = re.findall(r'<a href="([^">]+?)"', html)
this matches anything except a quote and a terminating bracket.
These simple RE's will match to URL's that have been commented, URL-like literal strings inside bits of javascript, etc. So be careful about using the results!

Extracting a string from a txt file

So im just experimenting, trying to parse through the web using python and i thought i would try to make a script that would search for my favorite links to watch shows online. Im trying to now have my program search through sidereel.com for a good link to my desired show and return to me the links. I know that the site saves the links in the following format:
watch-freeseries.mu'then some long string that i need to ignore followed by '14792088'
So what i need to be able to do is to find this string in the txt file of the site and return to me only the 8 numbers at the end of the string. I not sure how i can get to the numbers and i need them because they are the link number. Any help would be much appreciated
You could use a regular expression to do this fairly easily.
>>> import re
>>> text = "watch-freeseries.mu=lklsflamflkasfmsaldfasmf14792088"
>>> expr = re.compile("watch\-freeseries\.mu.*?(\d{8})")
>>> expr.findall(text)
['14792088']
A breakdown of the expression:
watch\-freeseries\.mu - Match the start of the expected expression. Escape any possible special characters by preceding them with \.
.*? - Match any character. . means any character and * means that appear one after the other an infinite amount of times. The ? is to perform a non-greedy match so that the match will not overlap if two or more urls show up in the same string.
(\d{8}) - Match and save the last 8 digits
Note: If you're trying to parse links out of a webpage there are easier ways. I've seen many recommendations on StackOverflow for the BeautifulSoup package in particular. I've never used it myself so YMMV.

Regex Matching Error

I am new to Python (I dont have any programming training either), so please keep that in mind as I ask my question.
I am trying to search a retrieved webpage and find all links using a specified pattern. I have done this successfully in other scripts, but I am getting an error that says
raise error, v # invalid expression
sre_constants.error: multiple repeat
I have to admit I do not know why, but again, I am new to Python and Regular Expressions. However, even when I don't use patterns and use a specific link (just to test the matching), I do not believe I return any matches (nothing is sent to the window when I print match.group(0). The link I tested is commented out below.
Any ideas? It usually is easier for me to learn by example, but any advice you can give is greatly appreciated!
Brock
import urllib2
from BeautifulSoup import BeautifulSoup
import re
url = "http://forums.epicgames.com/archive/index.php?f-356-p-164.html"
page = urllib2.urlopen(url).read()
soup = BeautifulSoup(page)
pattern = r'(.?+) <i>((.?+) replies)'
#pattern = r'href="http://forums.epicgames.com/archive/index.php?t-622233.html">Gears of War 2: Horde Gameplay</a> <i>(20 replies)'
for match in re.finditer(pattern, page, re.S):
print match(0)
That means your regular expression has an error.
(.?+)</a> <i>((.?+)
What does ?+ mean? Both ? and + are meta characters that does not make sense right next to each other. Maybe you forgot to escape the '?' or something.
You need to escape the literal '?' and the literal '(' and ')' that you are trying to match.
Also, instead of '?+', I think you're looking for the non-greedy matching provided by '+?'.
More documentation here.
For your case, try this:
pattern = r' (.+?) <i>\((.+?) replies\)'
As you're discovering, parsing arbitrary HTML is not easy to do correctly. That's what packages like Beautiful Soup do. Note, you're calling it in your script but then not using the results. Refer to its documentation here for examples of how to make your task a lot easier!
import urllib2
import re
from BeautifulSoup import BeautifulSoup
url = "http://forums.epicgames.com/archive/index.php?f-356-p-164.html"
page = urllib2.urlopen(url).read()
soup = BeautifulSoup(page)
# Get all the links
links = [str(match) for match in soup('a')]
s = r'(.+?)'
r = re.compile(s)
for link in links:
m = r.match(link)
if m:
print m.groups(1)[0]
To extend on what others wrote:
.? means "one or zero of any character"
.+ means "one ore more of any character"
As you can hopefully see, combining the two makes no sense; they are different and contradictory "repeat" characters. So, your error about "multiple repeats" is because you combined those two "repeat" characters in your regular expression. To fix it, just decide which one you actually meant to use, and delete the other.

Categories

Resources