I am writing a simple Python program which grabs a webpage and finds all the URL links in it. However I try to index the starting and ending delimiter (") of each href link but the ending one always indexed wrong.
# open a url and find all the links in it
import urllib2
url=urllib2.urlopen('right.html')
urlinfo = url.info()
urlcontent = url.read()
bodystart = urlcontent.index('<body')
print 'body starts at',bodystart
bodycontent = urlcontent[bodystart:].lower()
print bodycontent
linklist = []
n = bodycontent.index('<a href=')
while n:
print n
bodycontent = bodycontent[n:]
a = bodycontent.index('"')
b = bodycontent[(a+1):].index('"')
print a, b
linklist.append(bodycontent[(a+1):b])
n = bodycontent[b:].index('<a href=')
print linklist
I would suggest using a html parsing library instead of manually searching the DOM String.
Beautiful Soup is an excellent library for this purpose. Here is the reference link
With bs your link searching functionality could look like:
from bs4 import BeautifulSoup
soup = BeautifulSoup(bodycontent, 'html.parser')
linklist = [a.get('href') for a in soup.find_all('a')]
Related
Okay, so I am doing a course on Pyhton and the assignment asks us to retrieve data from an html document.
Here is what I came up with:
from urllib.request import urlopen
from bs4 import BeautifulSoup
import ssl
ctx = ssl.create_default_context()
ctx.check_hostname = False
ctx.verify_mode = ssl.CERT_NONE
intlist = list()
tot = 0
count = 0
url = input('Enter - ')
html = urlopen(url, context=ctx).read()
soup = BeautifulSoup(html, "html.parser")
tags = soup('span')
for tag in tags:
n = tag.contents[0]
n = int(n)
count += 1
tot = tot + n
print("Count:", n)
print("Total:", tot)
And this is what happens when I try to access the file (NOTE: the file I am trying to retrieve is stored locally):
What is the cause of this error?
Thanks anyone for the help.
You're supposed to read the html directly into BeautifulSoup. You cannot open a local file using urlopen as easily.
from bs4 import BeautifulSoup
...
with open('filename.html', 'r') as htmlfile:
html = htmlfile.read()
soup = BeautifulSoup(html, 'html.parser')
Now it's loaded for you to parse, don't forget to change filename.html to your actual file path
Edit: There are also many more problems with your code. soup('span') does not find span elements. Please refer to the docs for at least a basic understanding.
focus_Search = raw_input("Focus Search ")
url = "https://www.google.com/search?q="
res = requests.get(url + focus_Search)
print("You Just Searched")
res_String = res.text
#Now I must get ALL the sections of code that start with "<a href" and end with "/a>"
Im trying to scrape all the links from a google search webpage. I could extract each link one at a time but I'm sure theres a better way to do it.
This creates a list of all links in the search page with some of your code, without getting into BeautifulSoup
import requests
import lxml.html
focus_Search = input("Focus Search ")
url = "https://www.google.com/search?q="
#focus_Search
res = requests.get(url + focus_Search).content
# res
dom = lxml.html.fromstring(res)
links = [x for x in dom.xpath('//a/#href')] # Borrows from cheekybastard in link below
# http://stackoverflow.com/questions/1080411/retrieve-links-from-web-page-using-python-and-beautifulsoup
links
So i have beautiful soup code which visit the main the main page of a website
and scrapes the links there. However when I get the links in python I can't seem to clean up the link (after it's converted to a string) for concatenation with the root url.
import re
import requests
import bs4
list1=[]
def get_links():
regex3= re.compile('/[a-z\-]+/[a-z\-]+')
response = requests.get('http://noisetrade.com')
soup = bs4.BeautifulSoup(response.text)
links= soup.select('div.grid_info a[href]')
for link in links:
lk= link.get('href')
prtLk= regex3.findall(lk)
list1.append(prtLk)
def visit_pages():
url1=str(list1[1])
print(url)
get_links()
visit_pages()
produces output: "['/stevevantinemusic/unsolicited-material']"
desired output:"/stevevantinemusic/unsolicited-material"
I have tried .strip() and .replace() and re.sub/match/etc. . . I can't seem to isolate the chars '[,\',]' which are the characters I need removed, I had iterated through it with sub-strings but that feels inefficient. I'm sure I'm missing something obvious.
findall returns a list of results so you can either write:
for link in links:
lk = link.get('href')
urls = regex3.findall(lk)
if urls:
prtLk = urls[0]
list1.append(prtLk)
or better, use search method:
for link in links:
lk = link.get('href')
m = regex3.search(lk)
if m:
prtLk = m.group()
list1.append(prtLk)
Those brackets were the result of converting a list with one element to a string.
For example:
l = ['text']
str(l)
results in:
"['text']"
Here I use the regexp r'[\[\'\]]' to replace any of the unwanted characters with the empty string:
$ cat pw.py
import re
def visit_pages():
url1="['/stevevantinemusic/unsolicited-material']"
url1 = re.sub(r'[\[\'\]]','',url1)
print(url1)
visit_pages()
$ python pw.py
/stevevantinemusic/unsolicited-material
Here is an example of what I think you are trying to do:
>>> import bs4
>>> with open('noise.html', 'r') as f:
... lines = f.read()
...
>>> soup = bs4.BeautifulSoup(lines)
>>> root_url = 'http://noisetrade.com'
>>> for link in soup.select('div.grid_info a[href]'):
... print(root_url + link.get('href'))
...
http://noisetrade.com/stevevantinemusic
http://noisetrade.com/stevevantinemusic/unsolicited-material
http://noisetrade.com/jessicarotter
http://noisetrade.com/jessicarotter/winter-sun
http://noisetrade.com/geographermusic
http://noisetrade.com/geographermusic/live-from-the-el-rey-theatre
http://noisetrade.com/kaleo
http://noisetrade.com/kaleo/all-the-pretty-girls-ep
http://noisetrade.com/aviddancer
http://noisetrade.com/aviddancer/an-introduction
http://noisetrade.com/thinkr
http://noisetrade.com/thinkr/quiet-kids-ep
http://noisetrade.com/timcaffeemusic
http://noisetrade.com/timcaffeemusic/from-conversations
http://noisetrade.com/pearl
http://noisetrade.com/pearl/hello
http://noisetrade.com/staceyrandolmusic
http://noisetrade.com/staceyrandolmusic/fables-noisetrade-sampler
http://noisetrade.com/sleepyholler
http://noisetrade.com/sleepyholler/sleepy-holler
http://noisetrade.com/sarahmcgowanmusic
http://noisetrade.com/sarahmcgowanmusic/indian-summer
http://noisetrade.com/briandunne
http://noisetrade.com/briandunne/songs-from-the-hive
Remember also bs4 has its own types that it uses.
A good way to debug your scripts would be to place:
for link in links:
import pdb;pdb.set_trace() # the script will stop for debugging here
lk= link.get('href')
prtLk= regex3.findall(lk)
list1.append(prtLk)
Anywhere you want to debug.
And then you could do something like this within pdb:
next
l
print(type(lk))
print(links)
dir()
dir(links)
dir(lk)
I'm trying to parse a web page, and that's my code:
from bs4 import BeautifulSoup
import urllib2
openurl = urllib2.urlopen("http://pastebin.com/archive/Python")
read = BeautifulSoup(openurl.read())
soup = BeautifulSoup(openurl)
x = soup.find('ul', {"class": "i_p0"})
sp = soup.findAll('a href')
for x in sp:
print x
I really with I could be more specific but as the title says, it gives me no response. No errors, nothing.
First of all, omit the line read = BeautifulSoup(openurl.read()).
Also, the line x = soup.find('ul', {"class": "i_p0"}) doesn't actually make any difference, because you are reusing x variable in the loop.
Also, soup.findAll('a href') doesn't find anything.
Also, instead of old-fashioned findAll(), there is a find_all() in BeautifulSoup4.
Here's the code with several alterations:
from bs4 import BeautifulSoup
import urllib2
openurl = urllib2.urlopen("http://pastebin.com/archive/Python")
soup = BeautifulSoup(openurl)
sp = soup.find_all('a')
for x in sp:
print x['href']
This prints the values of href attribute of all links on the page.
Hope that helps.
I altered a couple of lines in your code and I do get a response, not sure if that is what you want though.
Here:
openurl = urllib2.urlopen("http://pastebin.com/archive/Python")
soup = BeautifulSoup(openurl.read()) # This is what you need to use for selecting elements
# soup = BeautifulSoup(openurl) # This is not needed
# x = soup.find('ul', {"class": "i_p0"}) # You don't seem to be making a use of this either
sp = soup.findAll('a')
for x in sp:
print x.get('href') #This is to get the href
Hope this helps.
I've made a web crawler which gives a link and text from link for all sites in given addres it looks like this:
import urllib
from bs4 import BeautifulSoup
import urlparse
import mechanize
url = ["http://adbnews.com/area51"]
for u in url:
br = mechanize.Browser()
urls = [u]
visited = [u]
i = 0
while i<len(urls):
try:
br.open(urls[0])
urls.pop(0)
for link in br.links():
levelLinks = []
linkText = []
newurl = urlparse.urljoin(link.base_url, link.url)
b1 = urlparse.urlparse(newurl).hostname
b2 = urlparse.urlparse(newurl).path
newurl = "http://"+b1+b2
linkTxt = link.text
linkText.append(linkTxt)
levelLinks.append(newurl)
if newurl not in visited and urlparse.urlparse(u).hostname in newurl:
urls.append(newurl)
visited.append(newurl)
#print newurl
#get Mechanize Links
for l,lt in zip(levelLinks,linkText):
print newurl,"\n",lt,"\n"
except:
urls.pop(0)
it gets results like that:
http://www.adbnews.com/area51/contact.html
CONTACT
http://www.adbnews.com/area51/about.html
ABOUT
http://www.adbnews.com/area51/index.html
INDEX
http://www.adbnews.com/area51/1st/
FIRST LEVEL!
http://www.adbnews.com/area51/1st/bling.html
BLING
http://www.adbnews.com/area51/1st/index.html
INDEX
http://adbnews.com/area51/2nd/
2ND LEVEL
And I wanna add a counter of somekind which could limit how deep crawler goes..
I've tried add for example steps = 3 and change while i<len(urls) in while i<steps:
but that would go only to first level even the number says 3...
Any advice is welcome
If you want to search a certain "depth", consider using a recursive function instead of just appending a list of URL's.
def crawl(url, depth):
if depth <= 3:
#Scan page, grab links, title
for link in links:
print crawl(link, depth + 1)
return url +"\n"+ title
This allows for easier control of your recursive searching, as well as being faster and less resource heavy :)