I'm trying to find a specific CSS media query (#media only screen) in CSS files of websites by using a crawler in python 2.7.
Right now I can crawl websites/URLs (from a CSV file) to find specific keywords in their HTML source code using the following code:
import urllib2
keyword = ['keyword to find']
with open('listofURLs.csv') as f:
for line in f:
strdomain = line.strip()
if strdomain:
req = urllib2.Request(strdomain.strip())
response = urllib2.urlopen(req)
html_content = response.read()
for searchstring in keyword:
if searchstring.lower() in str(html_content).lower():
print (strdomain, keyword, 'found')
f.close()
However, I now would like to crawl websites/ULRs (from the CSV file) to find the #media only screen query in their CSS files/source code. How should my code look like?
So, you have to:
1° read the csv file and put each url in a Python list;
2° loop this list, go to the pages and extract the list of css links. You need an HTML parser, for example BeautifulSoup;
3° browse the list of links and extract the item you need. There are CSS parsers like tinycss or cssutils, but I've never used them. A regex can maybe do the trick, even if this is probably not recommended.
4° write the results
Since you know how to read the csv (PS : no need to close the file with f.close() when you use the with open method), here is a minimal suggestion for operations 2 and 3. Feel free to adapt it to your needs and to improve it. I used Python 3, but I think it works on Python 2.7.
import re
import requests
from bs4 import BeautifulSoup
url_list = ["https://76crimes.com/2014/06/25/zambia-to-west-dont-watch-when-we-jail-lgbt-people/"]
for url in url_list:
try:
response = requests.get(url)
soup = BeautifulSoup(response.content, 'lxml')
css_links = [link["href"] for link in soup.findAll("link") if "stylesheet" in link.get("rel", [])]
print(css_links)
except Exception as e:
print(e, url)
pass
css_links = ["https://cdn.sstatic.net/Sites/stackoverflow/all.css?v=d9243128ba1c"]
#your regular expression
pattern = re.compile(r'#media only screen.+?\}')
for url in css_links:
try:
response = requests.get(url).text
media_only = pattern.findall(response)
print(media_only)
except Exception as e:
print(e, url)
pass
Related
Here is my code:
import os
import requests
from urllib.parse import urljoin
from bs4 import BeautifulSoup
url = "https://mathsmadeeasy.co.uk/gcse-maths-revision/"
#If there is no such folder, the script will create one automatically
folder_location = r'E:\webscraping'
if not os.path.exists(folder_location):os.mkdir(folder_location)
response = requests.get(url)
soup= BeautifulSoup(response.text, "html.parser")
for link in soup.select("a[href$='.pdf']"):
#Name the pdf files using the last portion of each link which are unique in this case
filename = os.path.join(folder_location,link['href'].split('/')[-1])
with open(filename, 'wb') as f:
f.write(requests.get(urljoin(url,link['href'])).content)
Any help as to why the code does not download any of my files format maths revision site.
Thanks.
Looking at the page itself, while it may look like it is static, it isn't. The content you are trying to access is gated behind some fancy javascript loading. What I've done to assess that is simply logging the page that BS4 actually got and opening it in a text editor:
with open(folder_location+"\page.html", 'wb') as f:
f.write(response.content)
By the look of it, the page is remplacing placeholders with JS, as hinted by the comment line 70 of the HTML file: // interpolate json by replacing placeholders with variables
For solutions to your problems, it seems BS4 is not able to load Javascript. I suggest looking at this answer for someone who had a similar problem. I also suggest looking into Scrapy if you intend to do some more complex web scraping.
I am new to python and cannot understand how to implement the following task.
There is a txt file with domains (about 10,000 domains) stored in upper case. Necessary:
- transfer domain addresses to lower case
- add the string 'http: //' at the beginning of the domain so that the domain is then inserted into requests
- make a loop so that the parser collects the title from each domain (site)
- write everything to a file table with two fields | site url | site title |
that's what happened:
import requests
from bs4 import BeautifulSoup as bs
f = open(r'file.txt','r+')
a=[]
for i in f:
a.append(i.lower().replace('\n',''))
a[-1]='http://'+a[-1]
f.close()
title_list=[]
for url in a:
try:
r=requests.get(url)
page=bs(r.content,'html.parser')
title=page.find('title')
title_list.append(url)
title_list.append(title.text.replace('\n',''))
except Exception as e:
print(e)
print(title_list)
I don’t know how to implement server error checking
You can do something similar to this.
import urllib2
from BeautifulSoup import BeautifulSoup
file = open('urllist.txt', 'r')
urlList = file.readlines()
file.close()
titles = []
for url in urlList:
soup = BeautifulSoup(urllib2.urlopen('https://' + url.lower()))
titles.append(soup.title.string)
Note:
'urllist.txt' is the file containing URLs
titles will contain the list of website titles
Hope this helps :)
So I wrote a crawler for my friend that will go through a large list of web pages that are search results, pull all the links off the page, check if they're in the output file and add if they're not there. It took a lot of debugging but it works great! Unfortunately, the little bugger is really picky about which anchored tags it deems important enough to add.
Here's the code:
#!C:\Python27\Python.exe
from bs4 import BeautifulSoup
from urlparse import urljoin #urljoin is a class that's included in urlparse
import urllib2
import requests #not necessary but keeping here in case additions to code in future
urls_filename = "myurls.txt" #this is the input text file,list of urls or objects to scan
output_filename = "output.txt" #this is the output file that you will export to Excel
keyword = "skin" #optional keyword, not used for this script. Ignore
with open(urls_filename, "r") as f:
url_list = f.read() #This command opens the input text file and reads the information inside it
with open(output_filename, "w") as f:
for url in url_list.split("\n"): #This command splits the text file into separate lines so it's easier to scan
hdr = {'User-Agent': 'Mozilla/5.0'} #This (attempts) to tell the webpage that the program is a Firefox browser
try:
response = urllib2.urlopen(url) #tells program to open the url from the text file
except:
print "Could not access", url
continue
page = response.read() #this assigns a variable to the open page. like algebra, X=page opened
soup = BeautifulSoup(page) #we are feeding the variable to BeautifulSoup so it can analyze it
urls_all = soup('a') #beautiful soup is analyzing all the 'anchored' links in the page
for link in urls_all:
if('href' in dict(link.attrs)):
url = urljoin(url, link['href']) #this combines the relative link e.g. "/support/contactus.html" and adds to domain
if url.find("'")!=-1: continue #explicit statement that the value is not void. if it's NOT void, continue
url=url.split('#')[0]
if (url[0:4] == 'http' and url not in output_filename): #this checks if the item is a webpage and if it's already in the list
f.write(url + "\n") #if it's not in the list, it writes it to the output_filename
It works great except for the following link:
https://research.bidmc.harvard.edu/TVO/tvotech.asp
This link has a number of like "tvotech.asp?Submit=List&ID=796" and it's straight up ignoring them. The only anchor that goes into my output file is the main page itself. It's bizarre because looking at the source code, their anchors are pretty standard, like-
They have 'a' and 'href', I see no reason bs4 would just pass it and only include the main link. Please help. I've tried removing http from line 30 or changing it to https and that just removes all the results, not even the main page comes into the output.
that's cause one of the links there has a mailto in its href, it is then set to the url parameter and break the rest of the links as well cause the don't pass the url[0:4] == 'http' condition, it looks like this:
mailto:research#bidmc.harvard.edu?subject=Question about TVO Available Technology Abstracts
you should either filter it out or not use the same argument url in the loop, note the change to url1:
for link in urls_all:
if('href' in dict(link.attrs)):
url1 = urljoin(url, link['href']) #this combines the relative link e.g. "/support/contactus.html" and adds to domain
if url1.find("'")!=-1: continue #explicit statement that the value is not void. if it's NOT void, continue
url1=url1.split('#')[0]
if (url1[0:4] == 'http' and url1 not in output_filename): #this checks if the item is a webpage and if it's already in the list
f.write(url1 + "\n") #if it's not in the list, it writes it to the output_filename
Only interested in Python 3, this is a lab for school and we aren't using Python2.
Tools
Python 3 and
Ubuntu
I want to first be able to download webpages of my choosing, e.g. www.example.com/index.html
and save the index.html or what ever page I want.
Then do the following bash
grep Href cut -d"/" -f3 sort -u
But do this in python not using grep, wget, cut etc... but instead only using python 3 commands.
Also, not using any python scrappers such as scrapy etc... NO legacy python commands no urllib2
so I was thinking to start with,
import urllib.request
from urllib.error import HTTPError,URLError
o = urllib.request.urlopen(www.example.com/index.html) #should I use http:// ?
local_file = open(file_name, "w" + file_mode)
#Then write to my local file
local_file.write(o.read())
local_file.close()
except HTTPError as e:
print("HTTP Error:",e.code , url)
except URLError as e:
print("URL Error:",e.reason , url)
But I still need to filter out the href's from the file, and delete all the other stuff, how do i do that, and is the above code ok ?
I thought urllib.request would be better than urlretrieve because it was faster, but if you think that there is not much different maybe it's better to use urlretrieve ?
There is a python package called BeautifulSoup which does what you need.
import urllib2
from bs4 import BeautifulSoup
html_doc = urllib2.urlopen('http://www.google.com')
soup = BeautifulSoup(html_doc)
for link in soup.find_all('a'):
print(link.get('href'))
Hope that helps.
My try. Kept it as simple as I could (thus, not very efficient for an actual crawler)
from urllib import request
# Fetch the page from the url we want
page = request.urlopen('http://www.google.com').read()
# print(page)
def get_link(page):
""" Get the first link from the page provided """
tag = 'href="'
tag_end = '"'
tag_start = page.find(tag)
if not tag_start == -1:
tag_start += len(tag)
else:
return None, page
tag_end = page.find(tag_end, tag_start)
link = page[tag_start:tag_end]
page = page[tag_end:]
return link, page
def get_all_links(page):
""" Get all the links from the page provided by calling get_link """
all_links = []
while True:
link, page = get_link(page)
if link:
all_links.append(link)
else:
break
return all_links
print(get_all_links(page))
Here You can read a post where everything you need is explained:
- Download a web page
- Using beautifulSoup
- Extrac something you need
Extract HTML code from URL in Python and C# - http://www.manejandodatos.es/2014/1/extract-html-code-url-python-c
I have a folder full of Windows .URL files. I'd like to translate them into a list of MLA citations for my paper.
Is this a good application of Python? How can I get the page titles? I'm on Windows XP with Python 3.1.1.
This is a fantastic use for Python! The .URL file format has a syntax like this:
[InternetShortcut]
URL=http://www.example.com/
OtherStuff=irrelevant
To parse your .URL files, start with ConfigParser, which will read this and make an InternetShortcut section that you can read the URL from. Once you have a list of URLs, you can then use urllib or urllib2 to load the URL, and use a dumb regex to get the page title (or BeautifulSoup as Alex suggests).
Once you have that, you have a list of URLs and page titles...not enough for a full MLA citation, but should be enough to get you started, no?
Something like this (very rough, coding in the SO window):
from glob import glob
from urllib2 import urlopen
from ConfigParser import ConfigParser
from re import search
# I use RE here, you might consider BeautifulSoup because RE can be stupid
TITLE = r"<title>([^<]+)</title>"
result = []
for file in glob("*.url"):
config = ConfigParser.ConfigParser()
config.read(file)
url = config.get("InternetShortcut", "URL")
# Get the title
page = urlopen(url).read()
try: title = search(TITLE, page).groups()[0]
except: title = "Couldn't find title"
result.append((url, title))
for url, title in result:
print "'%s' <%s>" % (title, url)
Given a file that contains an HTML page, you can parse it to extract its title, and BeautifulSoup is the recommended third-party library for the job. Get the BeautifulSoup version compatible with Python 3.1 here, install it, then:
parse each file's contents into a soup object e.g. with:
from BeautifulSoup import BeautifulSoup
html = open('thefile.html', 'r').read()
soup = BeautifulSoup(html)
get the title tag, if any, and print its string contents (if any):
title = soup.find('title')
if title is None: print('No title!')
else: print('Title: ' + title.string)