How can I create a script that manufactures MLA citations? - python

I have a folder full of Windows .URL files. I'd like to translate them into a list of MLA citations for my paper.
Is this a good application of Python? How can I get the page titles? I'm on Windows XP with Python 3.1.1.

This is a fantastic use for Python! The .URL file format has a syntax like this:
[InternetShortcut]
URL=http://www.example.com/
OtherStuff=irrelevant
To parse your .URL files, start with ConfigParser, which will read this and make an InternetShortcut section that you can read the URL from. Once you have a list of URLs, you can then use urllib or urllib2 to load the URL, and use a dumb regex to get the page title (or BeautifulSoup as Alex suggests).
Once you have that, you have a list of URLs and page titles...not enough for a full MLA citation, but should be enough to get you started, no?
Something like this (very rough, coding in the SO window):
from glob import glob
from urllib2 import urlopen
from ConfigParser import ConfigParser
from re import search
# I use RE here, you might consider BeautifulSoup because RE can be stupid
TITLE = r"<title>([^<]+)</title>"
result = []
for file in glob("*.url"):
config = ConfigParser.ConfigParser()
config.read(file)
url = config.get("InternetShortcut", "URL")
# Get the title
page = urlopen(url).read()
try: title = search(TITLE, page).groups()[0]
except: title = "Couldn't find title"
result.append((url, title))
for url, title in result:
print "'%s' <%s>" % (title, url)

Given a file that contains an HTML page, you can parse it to extract its title, and BeautifulSoup is the recommended third-party library for the job. Get the BeautifulSoup version compatible with Python 3.1 here, install it, then:
parse each file's contents into a soup object e.g. with:
from BeautifulSoup import BeautifulSoup
html = open('thefile.html', 'r').read()
soup = BeautifulSoup(html)
get the title tag, if any, and print its string contents (if any):
title = soup.find('title')
if title is None: print('No title!')
else: print('Title: ' + title.string)

Related

Using Beautiful Soup to Scrape content encoded in unicode? [duplicate]

Closed. This question needs details or clarity. It is not currently accepting answers.
Want to improve this question? Add details and clarify the problem by editing this post.
Closed 2 years ago.
Improve this question
How can I retrieve the links of a webpage and copy the url address of the links using Python?
Here's a short snippet using the SoupStrainer class in BeautifulSoup:
import httplib2
from bs4 import BeautifulSoup, SoupStrainer
http = httplib2.Http()
status, response = http.request('http://www.nytimes.com')
for link in BeautifulSoup(response, parse_only=SoupStrainer('a')):
if link.has_attr('href'):
print(link['href'])
The BeautifulSoup documentation is actually quite good, and covers a number of typical scenarios:
https://www.crummy.com/software/BeautifulSoup/bs4/doc/
Edit: Note that I used the SoupStrainer class because it's a bit more efficient (memory and speed wise), if you know what you're parsing in advance.
For completeness sake, the BeautifulSoup 4 version, making use of the encoding supplied by the server as well:
from bs4 import BeautifulSoup
import urllib.request
parser = 'html.parser' # or 'lxml' (preferred) or 'html5lib', if installed
resp = urllib.request.urlopen("http://www.gpsbasecamp.com/national-parks")
soup = BeautifulSoup(resp, parser, from_encoding=resp.info().get_param('charset'))
for link in soup.find_all('a', href=True):
print(link['href'])
or the Python 2 version:
from bs4 import BeautifulSoup
import urllib2
parser = 'html.parser' # or 'lxml' (preferred) or 'html5lib', if installed
resp = urllib2.urlopen("http://www.gpsbasecamp.com/national-parks")
soup = BeautifulSoup(resp, parser, from_encoding=resp.info().getparam('charset'))
for link in soup.find_all('a', href=True):
print link['href']
and a version using the requests library, which as written will work in both Python 2 and 3:
from bs4 import BeautifulSoup
from bs4.dammit import EncodingDetector
import requests
parser = 'html.parser' # or 'lxml' (preferred) or 'html5lib', if installed
resp = requests.get("http://www.gpsbasecamp.com/national-parks")
http_encoding = resp.encoding if 'charset' in resp.headers.get('content-type', '').lower() else None
html_encoding = EncodingDetector.find_declared_encoding(resp.content, is_html=True)
encoding = html_encoding or http_encoding
soup = BeautifulSoup(resp.content, parser, from_encoding=encoding)
for link in soup.find_all('a', href=True):
print(link['href'])
The soup.find_all('a', href=True) call finds all <a> elements that have an href attribute; elements without the attribute are skipped.
BeautifulSoup 3 stopped development in March 2012; new projects really should use BeautifulSoup 4, always.
Note that you should leave decoding the HTML from bytes to BeautifulSoup. You can inform BeautifulSoup of the characterset found in the HTTP response headers to assist in decoding, but this can be wrong and conflicting with a <meta> header info found in the HTML itself, which is why the above uses the BeautifulSoup internal class method EncodingDetector.find_declared_encoding() to make sure that such embedded encoding hints win over a misconfigured server.
With requests, the response.encoding attribute defaults to Latin-1 if the response has a text/* mimetype, even if no characterset was returned. This is consistent with the HTTP RFCs but painful when used with HTML parsing, so you should ignore that attribute when no charset is set in the Content-Type header.
Others have recommended BeautifulSoup, but it's much better to use lxml. Despite its name, it is also for parsing and scraping HTML. It's much, much faster than BeautifulSoup, and it even handles "broken" HTML better than BeautifulSoup (their claim to fame). It has a compatibility API for BeautifulSoup too if you don't want to learn the lxml API.
Ian Blicking agrees.
There's no reason to use BeautifulSoup anymore, unless you're on Google App Engine or something where anything not purely Python isn't allowed.
lxml.html also supports CSS3 selectors so this sort of thing is trivial.
An example with lxml and xpath would look like this:
import urllib
import lxml.html
connection = urllib.urlopen('http://www.nytimes.com')
dom = lxml.html.fromstring(connection.read())
for link in dom.xpath('//a/#href'): # select the url in href for all a tags(links)
print link
import urllib2
import BeautifulSoup
request = urllib2.Request("http://www.gpsbasecamp.com/national-parks")
response = urllib2.urlopen(request)
soup = BeautifulSoup.BeautifulSoup(response)
for a in soup.findAll('a'):
if 'national-park' in a['href']:
print 'found a url with national-park in the link'
The following code is to retrieve all the links available in a webpage using urllib2 and BeautifulSoup4:
import urllib2
from bs4 import BeautifulSoup
url = urllib2.urlopen("http://www.espncricinfo.com/").read()
soup = BeautifulSoup(url)
for line in soup.find_all('a'):
print(line.get('href'))
Links can be within a variety of attributes so you could pass a list of those attributes to select.
For example, with src and href attributes (here I am using the starts with ^ operator to specify that either of these attributes values starts with http):
from bs4 import BeautifulSoup as bs
import requests
r = requests.get('https://stackoverflow.com/')
soup = bs(r.content, 'lxml')
links = [item['href'] if item.get('href') is not None else item['src'] for item in soup.select('[href^="http"], [src^="http"]') ]
print(links)
Attribute = value selectors
[attr^=value]
Represents elements with an attribute name of attr whose value is prefixed (preceded) by value.
There are also the commonly used $ (ends with) and * (contains) operators. For a full syntax list see the link above.
Under the hood BeautifulSoup now uses lxml. Requests, lxml & list comprehensions makes a killer combo.
import requests
import lxml.html
dom = lxml.html.fromstring(requests.get('http://www.nytimes.com').content)
[x for x in dom.xpath('//a/#href') if '//' in x and 'nytimes.com' not in x]
In the list comp, the "if '//' and 'url.com' not in x" is a simple method to scrub the url list of the sites 'internal' navigation urls, etc.
just for getting the links, without B.soup and regex:
import urllib2
url="http://www.somewhere.com"
page=urllib2.urlopen(url)
data=page.read().split("</a>")
tag="<a href=\""
endtag="\">"
for item in data:
if "<a href" in item:
try:
ind = item.index(tag)
item=item[ind+len(tag):]
end=item.index(endtag)
except: pass
else:
print item[:end]
for more complex operations, of course BSoup is still preferred.
This script does what your looking for, But also resolves the relative links to absolute links.
import urllib
import lxml.html
import urlparse
def get_dom(url):
connection = urllib.urlopen(url)
return lxml.html.fromstring(connection.read())
def get_links(url):
return resolve_links((link for link in get_dom(url).xpath('//a/#href')))
def guess_root(links):
for link in links:
if link.startswith('http'):
parsed_link = urlparse.urlparse(link)
scheme = parsed_link.scheme + '://'
netloc = parsed_link.netloc
return scheme + netloc
def resolve_links(links):
root = guess_root(links)
for link in links:
if not link.startswith('http'):
link = urlparse.urljoin(root, link)
yield link
for link in get_links('http://www.google.com'):
print link
To find all the links, we will in this example use the urllib2 module together
with the re.module
*One of the most powerful function in the re module is "re.findall()".
While re.search() is used to find the first match for a pattern, re.findall() finds all
the matches and returns them as a list of strings, with each string representing one match*
import urllib2
import re
#connect to a URL
website = urllib2.urlopen(url)
#read html code
html = website.read()
#use re.findall to get all the links
links = re.findall('"((http|ftp)s?://.*?)"', html)
print links
Why not use regular expressions:
import urllib2
import re
url = "http://www.somewhere.com"
page = urllib2.urlopen(url)
page = page.read()
links = re.findall(r"<a.*?\s*href=\"(.*?)\".*?>(.*?)</a>", page)
for link in links:
print('href: %s, HTML text: %s' % (link[0], link[1]))
Here's an example using #ars accepted answer and the BeautifulSoup4, requests, and wget modules to handle the downloads.
import requests
import wget
import os
from bs4 import BeautifulSoup, SoupStrainer
url = 'https://archive.ics.uci.edu/ml/machine-learning-databases/eeg-mld/eeg_full/'
file_type = '.tar.gz'
response = requests.get(url)
for link in BeautifulSoup(response.content, 'html.parser', parse_only=SoupStrainer('a')):
if link.has_attr('href'):
if file_type in link['href']:
full_path = url + link['href']
wget.download(full_path)
I found the answer by #Blairg23 working , after the following correction (covering the scenario where it failed to work correctly):
for link in BeautifulSoup(response.content, 'html.parser', parse_only=SoupStrainer('a')):
if link.has_attr('href'):
if file_type in link['href']:
full_path =urlparse.urljoin(url , link['href']) #module urlparse need to be imported
wget.download(full_path)
For Python 3:
urllib.parse.urljoin has to be used in order to obtain the full URL instead.
BeatifulSoup's own parser can be slow. It might be more feasible to use lxml which is capable of parsing directly from a URL (with some limitations mentioned below).
import lxml.html
doc = lxml.html.parse(url)
links = doc.xpath('//a[#href]')
for link in links:
print link.attrib['href']
The code above will return the links as is, and in most cases they would be relative links or absolute from the site root. Since my use case was to only extract a certain type of links, below is a version that converts the links to full URLs and which optionally accepts a glob pattern like *.mp3. It won't handle single and double dots in the relative paths though, but so far I didn't have the need for it. If you need to parse URL fragments containing ../ or ./ then urlparse.urljoin might come in handy.
NOTE: Direct lxml url parsing doesn't handle loading from https and doesn't do redirects, so for this reason the version below is using urllib2 + lxml.
#!/usr/bin/env python
import sys
import urllib2
import urlparse
import lxml.html
import fnmatch
try:
import urltools as urltools
except ImportError:
sys.stderr.write('To normalize URLs run: `pip install urltools --user`')
urltools = None
def get_host(url):
p = urlparse.urlparse(url)
return "{}://{}".format(p.scheme, p.netloc)
if __name__ == '__main__':
url = sys.argv[1]
host = get_host(url)
glob_patt = len(sys.argv) > 2 and sys.argv[2] or '*'
doc = lxml.html.parse(urllib2.urlopen(url))
links = doc.xpath('//a[#href]')
for link in links:
href = link.attrib['href']
if fnmatch.fnmatch(href, glob_patt):
if not href.startswith(('http://', 'https://' 'ftp://')):
if href.startswith('/'):
href = host + href
else:
parent_url = url.rsplit('/', 1)[0]
href = urlparse.urljoin(parent_url, href)
if urltools:
href = urltools.normalize(href)
print href
The usage is as follows:
getlinks.py http://stackoverflow.com/a/37758066/191246
getlinks.py http://stackoverflow.com/a/37758066/191246 "*users*"
getlinks.py http://fakedomain.mu/somepage.html "*.mp3"
There can be many duplicate links together with both external and internal links. To differentiate between the two and just get unique links using sets:
# Python 3.
import urllib
from bs4 import BeautifulSoup
url = "http://www.espncricinfo.com/"
resp = urllib.request.urlopen(url)
# Get server encoding per recommendation of Martijn Pieters.
soup = BeautifulSoup(resp, from_encoding=resp.info().get_param('charset'))
external_links = set()
internal_links = set()
for line in soup.find_all('a'):
link = line.get('href')
if not link:
continue
if link.startswith('http'):
external_links.add(link)
else:
internal_links.add(link)
# Depending on usage, full internal links may be preferred.
full_internal_links = {
urllib.parse.urljoin(url, internal_link)
for internal_link in internal_links
}
# Print all unique external and full internal links.
for link in external_links.union(full_internal_links):
print(link)
import urllib2
from bs4 import BeautifulSoup
a=urllib2.urlopen('http://dir.yahoo.com')
code=a.read()
soup=BeautifulSoup(code)
links=soup.findAll("a")
#To get href part alone
print links[0].attrs['href']

Extract CSS media queries from websites with python 2.7

I'm trying to find a specific CSS media query (#media only screen) in CSS files of websites by using a crawler in python 2.7.
Right now I can crawl websites/URLs (from a CSV file) to find specific keywords in their HTML source code using the following code:
import urllib2
keyword = ['keyword to find']
with open('listofURLs.csv') as f:
for line in f:
strdomain = line.strip()
if strdomain:
req = urllib2.Request(strdomain.strip())
response = urllib2.urlopen(req)
html_content = response.read()
for searchstring in keyword:
if searchstring.lower() in str(html_content).lower():
print (strdomain, keyword, 'found')
f.close()
However, I now would like to crawl websites/ULRs (from the CSV file) to find the #media only screen query in their CSS files/source code. How should my code look like?
So, you have to:
1° read the csv file and put each url in a Python list;
2° loop this list, go to the pages and extract the list of css links. You need an HTML parser, for example BeautifulSoup;
3° browse the list of links and extract the item you need. There are CSS parsers like tinycss or cssutils, but I've never used them. A regex can maybe do the trick, even if this is probably not recommended.
4° write the results
Since you know how to read the csv (PS : no need to close the file with f.close() when you use the with open method), here is a minimal suggestion for operations 2 and 3. Feel free to adapt it to your needs and to improve it. I used Python 3, but I think it works on Python 2.7.
import re
import requests
from bs4 import BeautifulSoup
url_list = ["https://76crimes.com/2014/06/25/zambia-to-west-dont-watch-when-we-jail-lgbt-people/"]
for url in url_list:
try:
response = requests.get(url)
soup = BeautifulSoup(response.content, 'lxml')
css_links = [link["href"] for link in soup.findAll("link") if "stylesheet" in link.get("rel", [])]
print(css_links)
except Exception as e:
print(e, url)
pass
css_links = ["https://cdn.sstatic.net/Sites/stackoverflow/all.css?v=d9243128ba1c"]
#your regular expression
pattern = re.compile(r'#media only screen.+?\}')
for url in css_links:
try:
response = requests.get(url).text
media_only = pattern.findall(response)
print(media_only)
except Exception as e:
print(e, url)
pass

Searching Large String for file path. Return filepath + filename

I've got a little project where I’m trying to download a series of wallpapers from a web page. I'm new to python.
I'm using the urllib library, which is returning a long string of web page data which includes
<a href="http://website.com/wallpaper/filename.jpg">
I know that every filename I need to download has
'http://website.com/wallpaper/'
How can i search the page source for this portion of text, and return the rest of the image link, ending with "*.jpg" extension?
r'http://website.com/wallpaper/ xxxxxx .jpg'
I'm thinking if I could format a regular expression with the xxxx portion not being evaluated? Just check for the path, and the .jpg extension. Then return the whole string once a match is found
Am I on the right track?
BeautifulSoup is pretty convenient for this sort of thing.
import re
import urllib3
from bs4 import BeautifulSoup
jpg_regex = re.compile('\.jpg$')
site_regex = re.compile('website\.com\/wallpaper\/')
pool = urllib3.PoolManager()
request = pool.request('GET', 'http://your_website.com/')
soup = BeautifulSoup(request)
jpg_list = list(soup.find_all(name='a', attrs={'href':jpg_regex}))
site_list = list(soup.find_all(name='a', attrs={'href':site_regex}))
result_list = map(lambda a: a.get('href'), jpg_list and site_list)
I think a very basic regex will do.
Like:
(http:\/\/website\.com\/wallpaper\/[\w\d_-]*?\.jpg)
and if you use $1this will return the whole String .
And if you use
(http:\/\/website\.com\/wallpaper\/([\w\d_-]*?)\.jpg)
then $1 will give the whole string and $2 will give the file name only.
Note: escaping (\/) is language dependent so use what is supported by python.
Don't use a regular expression against HTML.
Instead, use a HTML parsing library.
BeautifulSoup is a library for parsing HTML and urllib2 is a built-in module for fetching URLs
import urllib2
from bs4 import BeautifulSoup as bs
content = urllib2.urlopen('http://website.com/wallpaper/index.html').read()
html = bs(content)
links = [] # an empty list
for link in html.find_all('a'):
href = link.get('href')
if '/wallpaper/' in href:
links.append(href)
Search for the "http://website.com/wallpaper/" substring in url and then check for ".jpg" in url, as shown below:
domain = "http://website.com/wallpaper/"
url = str("your URL")
format = ".jpg"
for domain in url and format in url:
//do something

Python 3, how to scrape webpages and filter for Href's, delete rubbish and sort

Only interested in Python 3, this is a lab for school and we aren't using Python2.
Tools
Python 3 and
Ubuntu
I want to first be able to download webpages of my choosing, e.g. www.example.com/index.html
and save the index.html or what ever page I want.
Then do the following bash
grep Href cut -d"/" -f3 sort -u
But do this in python not using grep, wget, cut etc... but instead only using python 3 commands.
Also, not using any python scrappers such as scrapy etc... NO legacy python commands no urllib2
so I was thinking to start with,
import urllib.request
from urllib.error import HTTPError,URLError
o = urllib.request.urlopen(www.example.com/index.html) #should I use http:// ?
local_file = open(file_name, "w" + file_mode)
#Then write to my local file
local_file.write(o.read())
local_file.close()
except HTTPError as e:
print("HTTP Error:",e.code , url)
except URLError as e:
print("URL Error:",e.reason , url)
But I still need to filter out the href's from the file, and delete all the other stuff, how do i do that, and is the above code ok ?
I thought urllib.request would be better than urlretrieve because it was faster, but if you think that there is not much different maybe it's better to use urlretrieve ?
There is a python package called BeautifulSoup which does what you need.
import urllib2
from bs4 import BeautifulSoup
html_doc = urllib2.urlopen('http://www.google.com')
soup = BeautifulSoup(html_doc)
for link in soup.find_all('a'):
print(link.get('href'))
Hope that helps.
My try. Kept it as simple as I could (thus, not very efficient for an actual crawler)
from urllib import request
# Fetch the page from the url we want
page = request.urlopen('http://www.google.com').read()
# print(page)
def get_link(page):
""" Get the first link from the page provided """
tag = 'href="'
tag_end = '"'
tag_start = page.find(tag)
if not tag_start == -1:
tag_start += len(tag)
else:
return None, page
tag_end = page.find(tag_end, tag_start)
link = page[tag_start:tag_end]
page = page[tag_end:]
return link, page
def get_all_links(page):
""" Get all the links from the page provided by calling get_link """
all_links = []
while True:
link, page = get_link(page)
if link:
all_links.append(link)
else:
break
return all_links
print(get_all_links(page))
Here You can read a post where everything you need is explained:
- Download a web page
- Using beautifulSoup
- Extrac something you need
Extract HTML code from URL in Python and C# - http://www.manejandodatos.es/2014/1/extract-html-code-url-python-c

get all links from html even with show more link

I am using python and beautifulsoup for html parsing.
I am using the following code :
from BeautifulSoup import BeautifulSoup
import urllib2
import re
url = "http://www.wikipathways.org//index.php?query=signal+transduction+pathway&species=Homo+sapiens&title=Special%3ASearchPathways&doSearch=1&ids=&codes=&type=query"
main_url = urllib2.urlopen(url)
content = main_url.read()
soup = BeautifulSoup(content)
for a in soup.findAll('a',href=True):
print a[href]
but I am not getting output links like :
http://www.wikipathways.org/index.php/Pathway:WP26
and also imp thing is there are 107 pathways. but I will not get all the links as other lins depends on "show links" at the bottom of the page.
so, how can I get all the links (107 links) from that url?
Your problem is line 8, content = url.read(). You're not actually reading the webpage, you're actually just doing nothing (If anything, you should be getting an error).
main_url is what you want to read, so change line 8 to:
content = main_url.read()
You also have another error, print a[href]. href should be a string, so it should be:
print a['href']
I would suggest using lxml its faster and better for parsing html worth investing the time to learn it.
from lxml.html import parse
dom = parse('http://www.wikipathways.org//index.php?query=signal+transduction+pathway&species=Homo+sapiens&title=Special%3ASearchPathways&doSearch=1&ids=&codes=&type=query').getroot()
links = dom.cssselect('a')
That should get you going.

Categories

Resources