Scrape specific urls from a page and convert them to absolute urls - python

I need some help from you Pythonists: I'm scraping all urls starting with "details.php?" from this page and ignoring all other urls.
Then I need to convert every url I just scraped to an absolute url, so I can scrape them one by one. The absolute urls start with: http://evenementen.uitslagen.nl/2013/marathonrotterdam/details.php?...
I tried using re.findall like this:
html = scraperwiki.scrape(url)
if html is not None:
endofurl = re.findall("details.php?(.*?)>", html)
This gets me a list, but then I get stuck. Can anybody help me out?

You can use urlparse.urljoin() to create the full urls:
>>> import urlparse
>>> base_url = 'http://evenementen.uitslagen.nl/2013/marathonrotterdam/'
>>> urlparse.urljoin(base_url, 'details.php?whatever')
'http://evenementen.uitslagen.nl/2013/marathonrotterdam/details.php?whatever'
You can use a list comprehension to do this for all of your urls:
full_urls = [urlparse.urljoin(base_url, url) for url in endofurl]

If you need the final urls one by one and be done with them, you should use generator instead of the iterators.
abs_url = "url data"
urls = (abs_url+url for url in endofurl)
If you are worried about encoding the url you can use urllib.urlencode(url)

Ah! My favorite...list comprehensions!
base_url = 'http://evenementen.uitslagen.nl/2013/marathonrotterdam/{0}'
urls = [base.format(x) for x in list_of_things_you_scraped]
I'm not a regex genius, so you may need to fiddle with base_url until you get it exactly right.

If you'd like to use lxml.html to parse html; there is .make_links_absolute():
import lxml.html
html = lxml.html.make_links_absolute(html,
base_href="http://evenementen.uitslagen.nl/2013/marathonrotterdam/")

Related

How to get URL out of href that is itself a hyperlink?

I'm using Python and lxml to try to scrape this html page. The problem I'm running into is trying to get the URL out of this hyperlink text "Chapter02a". (Note that I can't seem to get the link formatting to work here).
<li>Examples of Operations</li>
I have tried
//ol[#id="ProbList"]/li/a/#href
but that only gives me the text "Chapter02a".
Also:
//ol[#id="ProbList"]/li/a
This returns a lxml.html.HtmlElement'object, and none of the properties that I found in the documentation accomplish what I'm trying to do.
from lxml import html
import requests
chapter_req = requests.get('https://www.math.wisc.edu/~mstemper2/Math/Pinter/Chapter02')
chapter_html = html.fromstring(chapter_req.content)
sections = chapter_html.xpath('//ol[#id="ProbList"]/li/a/#href')
print(sections[0])
I want sections to be a list of URLs to the subsections.
The return you are seeing is correct because Chapter02a is a "relative" link to the next section. The full url is not listed because that is not how it is stored in the html.
To get the full urls you can use:
url_base = 'https://www.math.wisc.edu/~mstemper2/Math/Pinter/'
sections = chapter_html.xpath('//ol[#id="ProbList"]/li/a/#href')
section_urls = [url_base + s for s in sections]
You can also do the concatenation directly at the XPATH level to regenerate the URL from the relative link:
from lxml import html
import requests
chapter_req = requests.get('https://www.math.wisc.edu/~mstemper2/Math/Pinter/Chapter02')
chapter_html = html.fromstring(chapter_req.content)
sections = chapter_html.xpath('concat("https://www.math.wisc.edu/~mstemper2/Math/Pinter/",//ol[#id="ProbList"]/li/a/#href)')
print(sections)
output:
https://www.math.wisc.edu/~mstemper2/Math/Pinter/Chapter02A

sanitize && build url

Is it possible, whith beautifulsoup (python), to extract urls absolute instead of relative urls of a web page ?
For example, when I scrap http://bing.com and ask a href links :
for link in soup.findAll('a'):
It return as well as relative than absolute urls :
http://bing.com/?scope=web&FORM=Z9LH
/maps/?FORM=Z9LH3
/news?FORM=Z9LH4
/explore?FORM=Z9LH5
/profile/history?FORM=Z9LH6
http://fr.msn.com/
http://www.office.com?WT.mc_id=O16_BingHP
Many thanks.
If you want to only match absolute URLs, the simplest way to do it would be to use a CSS selector:
soup.select("a[href^=http]")
Here ^= means "starts with".
If you want to locate all the links and make absolute URLs out of relative URLs, use urljoin():
from urlparse import urljoin
# Python 3: from urllib.parse import urljoin
base_url = "http://bing.com"
for link in soup.find_all("a", href=True):
absolute_url = urljoin(base_url, link["href"])
print(absolute_url)
Note that if the URL is already absolute, urljoin() would leave it as is.
Use filter() and lambdas.
urlList = filter(lambda aTag: aTag['href'].startswith('http'), soup('a'))
should do the trick.
In short, check the whether the 'href' attribute of your links start with the string 'http'.
If you want to recreate absolute URLs from the relatives ones, you can do this:
urlThatCurrentlyScraping = 'http://bing.com/something/...'
for link in soup('a'):
if not link['href'].startswith('http'):
fixedLinkHref = urlThatCurrentlyScraping + link['href']
else:
fixedLinkHref = link['href']
# do something

Searching Large String for file path. Return filepath + filename

I've got a little project where I’m trying to download a series of wallpapers from a web page. I'm new to python.
I'm using the urllib library, which is returning a long string of web page data which includes
<a href="http://website.com/wallpaper/filename.jpg">
I know that every filename I need to download has
'http://website.com/wallpaper/'
How can i search the page source for this portion of text, and return the rest of the image link, ending with "*.jpg" extension?
r'http://website.com/wallpaper/ xxxxxx .jpg'
I'm thinking if I could format a regular expression with the xxxx portion not being evaluated? Just check for the path, and the .jpg extension. Then return the whole string once a match is found
Am I on the right track?
BeautifulSoup is pretty convenient for this sort of thing.
import re
import urllib3
from bs4 import BeautifulSoup
jpg_regex = re.compile('\.jpg$')
site_regex = re.compile('website\.com\/wallpaper\/')
pool = urllib3.PoolManager()
request = pool.request('GET', 'http://your_website.com/')
soup = BeautifulSoup(request)
jpg_list = list(soup.find_all(name='a', attrs={'href':jpg_regex}))
site_list = list(soup.find_all(name='a', attrs={'href':site_regex}))
result_list = map(lambda a: a.get('href'), jpg_list and site_list)
I think a very basic regex will do.
Like:
(http:\/\/website\.com\/wallpaper\/[\w\d_-]*?\.jpg)
and if you use $1this will return the whole String .
And if you use
(http:\/\/website\.com\/wallpaper\/([\w\d_-]*?)\.jpg)
then $1 will give the whole string and $2 will give the file name only.
Note: escaping (\/) is language dependent so use what is supported by python.
Don't use a regular expression against HTML.
Instead, use a HTML parsing library.
BeautifulSoup is a library for parsing HTML and urllib2 is a built-in module for fetching URLs
import urllib2
from bs4 import BeautifulSoup as bs
content = urllib2.urlopen('http://website.com/wallpaper/index.html').read()
html = bs(content)
links = [] # an empty list
for link in html.find_all('a'):
href = link.get('href')
if '/wallpaper/' in href:
links.append(href)
Search for the "http://website.com/wallpaper/" substring in url and then check for ".jpg" in url, as shown below:
domain = "http://website.com/wallpaper/"
url = str("your URL")
format = ".jpg"
for domain in url and format in url:
//do something

How to filter and take only one download link?

I have this code:
import urllib
from bs4 import BeautifulSoup
url = "http://www.microsoft.com/en-us/download/confirmation.aspx?id=17851"
pageurl = urllib.urlopen(url)
soup = BeautifulSoup(pageurl)
for d in soup.select("p.start-download [href]"):
print d['href']
When I run this code,it give me many download link.
How can I only take only one of the download link given?
If you use your given code, you will not be able to take hold of the links and use them. Use the following code instead:
import urllib
from bs4 import BeautifulSoup
url = "http://www.microsoft.com/en-us/download/confirmation.aspx?id=17851"
pageurl = urllib.urlopen(url)
soup = BeautifulSoup(pageurl)
urls = []
for d in soup.select("p.start-download [href]"):
urls.append(d.attrs['href'])
print urls[0]
If you use the above code, then you can use the links themselves in other parts of the program. You may also do this using a lit comprehension:
urls = [d['href'] for d in soup.select("p.start-download [href]")]
print urls[0]
You can then iterate over urls to get the url you want, or just use an index to get your link. Either way, this is more flexible than just printing a link. For example if you did not want to full installation, and just wanted some other package or perhaps the package for XP instead of Vista, 7 and 8 (using your urls here as an example).
for d in soup.select("p.start-download [href]"):
print d['href']
break
will stop after the first link

Python - Easiest way to scrape text from list of URLs using BeautifulSoup

What's the easiest way to scrape just the text from a handful of webpages (using a list of URLs) using BeautifulSoup? Is it even possible?
Best,
Georgina
import urllib2
import BeautifulSoup
import re
Newlines = re.compile(r'[\r\n]\s+')
def getPageText(url):
# given a url, get page content
data = urllib2.urlopen(url).read()
# parse as html structured document
bs = BeautifulSoup.BeautifulSoup(data, convertEntities=BeautifulSoup.BeautifulSoup.HTML_ENTITIES)
# kill javascript content
for s in bs.findAll('script'):
s.replaceWith('')
# find body and extract text
txt = bs.find('body').getText('\n')
# remove multiple linebreaks and whitespace
return Newlines.sub('\n', txt)
def main():
urls = [
'http://www.stackoverflow.com/questions/5331266/python-easiest-way-to-scrape-text-from-list-of-urls-using-beautifulsoup',
'http://stackoverflow.com/questions/5330248/how-to-rewrite-a-recursive-function-to-use-a-loop-instead'
]
txt = [getPageText(url) for url in urls]
if __name__=="__main__":
main()
It now removes javascript and decodes html entities.
It is perfectly possible. Easiest way is to iterate through list of URLs, load the content, find the URLs, add them to main list. Stop iteration when enough pages are found.
Just some tips:
urllib2.urlopen for fetching content
BeautifulSoup: findAll('a') for finding URLs
I know that it is not an answer to your exact question (about BeautifulSoup) but a good idea is to have a look at Scrapy which seems to fit yous needs.

Categories

Resources