I have a long list of Wikipedia links in a plaintext file. Each link is separated by a newline and is percent-encoded. Unfortunately a large number of these links are outdated; some are redirects and others have been removed. Is there anyway to automatically sort through the links, resolving redirects and removing dead links?
A bash/python script would be nice, but any other working implementation is fine.
python mechanize is nice:
import mechanize
links = [
"http://en.wikipedia.org/wiki/Markov_chain",
"http://en.wikipedia.org/wiki/Dari",
"http://en.wikipedia.org/wiki/Frobnab"
]
br = mechanize.Browser()
br.addheaders = [('User-agent', 'Mozilla/5.0')] # A white lie
for link in links:
print link
try:
br.open(link)
page_name = br.title()[:-35].replace(" ", "_")
if page_name != link.split("/")[-1]:
print "redirected to:", page_name
else:
print "page OK"
except mechanize.URLError:
print "error: dead link"
It should be easy with Perl and LWP::UserAgent:
#!/usr/bin/perl
use LWP::UserAgent;
open my $fh, "links.txt" or die $!;
my #links = <$fh>;
my $ua = LWP::UserAgent->new;
for my $link (#links) {
my $resp = $ua->get($link); # automatically follows redirects
if ($resp->is_success) {
print $resp->request->uri, "\n";
}
}
This will not check if a link is a redirect, but will check all the links. Redirects will be deemed valid links (as long as the redirected page is found, obviously). Just fix the print what ever way you want to get the output you need.
#!/usr/bin/python
from urllib import urlopen
f = open('links.txt', 'r')
valid = []
broken = []
for line in f:
try:
urlopen(line)
valid = valid + [line]
except:
broken = broken + [line]
for link in valid:
print "VALID: " + link
for link in broken:
print "BROKEN: " + link
If you want to know which valid links are redirects, you can probably do it with urllib.FancyURLopener(), but I have never used it so can't be certain.
Related
I am trying to pull the the number of followers from a list of Instagram accounts. I have tried using the "find" method within Requests, however, the string that I am looking for when I inspect the actual Instagram no longer appears when I print "r" from the code below.
Was able to get this code to run successfully find the past, however, will no longer run.
Webscraping Instagram follower count BeautifulSoup
import requests
user = "espn"
url = 'https://www.instagram.com/' + user
r = requests.get(url).text
start = '"edge_followed_by":{"count":'
end = '},"followed_by_viewer"'
print(r[r.find(start)+len(start):r.rfind(end)])
I receive a "-1" error, which means the substring from the find method was not found within the variable "r".
I think it's because of the last ' in start and first ' in end...this will work:
import requests
import re
user = "espn"
url = 'https://www.instagram.com/' + user
r = requests.get(url).text
followers = re.search('"edge_followed_by":{"count":([0-9]+)}',r).group(1)
print(followers)
'14061730'
I want to suggest an updated solution to this question, as the answer of Derek Eden above from 2019 does not work anymore, as stated in its comments.
The solution was to add the r' before the regular expression in the re.search like so:
follower_count = re.search(r'"edge_followed_by\\":{\\"count\\":([0-9]+)}', response).group(1)
This r'' is really important as without it, Python seems to treat the expression as regular string which leads to the query not giving any results.
Also the instagram page seems to have backslashes in the object we look for at least in my tests, so the code example i use is the following in Python 3.10 and working as of July 2022:
# get follower count of instagram profile
import os.path
import requests
import re
import urllib3
urllib3.disable_warnings(urllib3.exceptions.InsecureRequestWarning)
# get instagram follower count
def get_instagram_follower_count(instagram_username):
url = "https://www.instagram.com/" + instagram_username
filename = "instagram.html"
try:
if not os.path.isfile(filename):
r = requests.get(url, verify=False)
print(r.status_code)
print(r.text)
response = r.text
if not r.status_code == 200:
raise Exception("Error: " + str(r.status_code))
with open(filename, "w") as f:
f.write(response)
else:
with open(filename, "r") as f:
response = f.read()
# print(response)
follower_count = re.search(r'"edge_followed_by\\":{\\"count\\":([0-9]+)}', response).group(1)
return follower_count
except Exception as e:
print(e)
return 0
print(get_instagram_follower_count('your.instagram.profile'))
The method returns the follower count as expected. Please note that i added a few lines to not hammer Instagrams webserver and get blocked while testing by just saving the response in a file.
This is a slice of the original html content that contains the part we are looking for:
... mRL&s=1\",\"edge_followed_by\":{\"count\":110070},\"fbid\":\"1784 ...
I debugged the regex in regexr, it seems to work just fine at this point in time.
There are many posts about the regex r prefix like this one
Also the documentation of the re package shows clearly that this is the issue with the code above.
I have a webcrawler, but currently the 404 error occurs when calling requests.get(url) from the requests module. But the URL is reachable.
base_url = "https://www.blogger.com/profile/"
site = base_url + blogs_to_visit.pop().rsplit('/', 1)[-1]
r = requests.get(site)
soup = BeautifulSoup(r.content, "html.parser")
# Printing some values for debugging
>>> print site
https://www.blogger.com/profile/01785989747304686024
>>> print r
<Response [404]>
However, if I hardcore the string site for the requests module as the exact same string. The response is 202.
site = "https://www.blogger.com/profile/01785989747304686024"
# Printing some values for debugging
>>> print site
https://www.blogger.com/profile/01785989747304686024
>>> print r
<Response [202]>
What just striked me is that it looks like a hidden newline after printing site the first time, might that be what's causing the problem?
The URL's to visit is earlier stored in a file with;
for link in soup.select("h2 a[href]"):
blogs.write(link.get("href") + "\n")
and fetched with
with open("foo") as p:
return p.readlines()
The question is then, what would be a better way of writing them to the file? If I dont seperate them with "\n" for eg, all the URL's are glued together as one.
In reference to Getting rid of \n when using .readlines(), perhaps use:
with open("foo") as p:
return p.read().splitlines()
you can use:
r = requests.get(site.strip('\n'))
instead of:
r = requests.get(site)
My code is to search a Link passed in the command prompt, get the HTML code for the webpage at the Link, search the HTML code for links on the webpage, and then repeat these steps for the links found. I hope that is clear.
It should print out any links that cause errors.
Some more needed info:
The max visits it can do is 100.
If a website has an error, a None value is returned.
Python3 is what I am using
eg:
s = readwebpage(url)... # This line of code gets the HTML code for the link(url) passed in its argument.... if the link has an error, s = None.
The HTML code for that website has links that end in p2.html, p3.html, p4.html, and p5.html on its webpage. My code reads all of these, but it does not visit these links individually to search for more links. If it did this, it should search through these links and find a link that ends in p10.html, and then it should report that the link ending with p10.html has errors. Obviously it doesn't do that at the moment, and it's giving me a hard time.
My code..
url = args.url[0]
url_list = [url]
checkedURLs = []
AmountVisited = 0
while (url_list and AmountVisited<maxhits):
url = url_list.pop()
s = readwebpage(url)
print("testing url: http",url) #Print the url being tested, this code is here only for testing..
AmountVisited = AmountVisited + 1
if s == None:
print("* bad reference to http", url)
else:
urls_list = re.findall(r'href="http([\s:]?[^\'" >]+)', s) #Creates a list of all links in HTML code starting with...
while urls_list: #... http or https
insert = urls_list.pop()
while(insert in checkedURLs and urls_list):
insert = urls_list.pop()
url_list.append(insert)
checkedURLs = insert
Please help :)
Here is the code you wanted. However, please, stop using regexes for parsing HTML. BeautifulSoup is the way to go for that.
import re
from urllib import urlopen
def readwebpage(url):
print "testing ",current
return urlopen(url).read()
url = 'http://xrisk.esy.es' #put starting url here
yet_to_visit= [url]
visited_urls = []
AmountVisited = 0
maxhits = 10
while (yet_to_visit and AmountVisited<maxhits):
print yet_to_visit
current = yet_to_visit.pop()
AmountVisited = AmountVisited + 1
html = readwebpage(current)
if html == None:
print "* bad reference to http", current
else:
r = re.compile('(?<=href=").*?(?=")')
links = re.findall(r,html) #Creates a list of all links in HTML code starting with...
for u in links:
if u in visited_urls:
continue
elif u.find('http')!=-1:
yet_to_visit.append(u)
print links
visited_urls.append(current)
Not Python but since you mentioned you aren't tied strictly to regex, I think you might find some use in using wget for this.
wget --spider -o C:\wget.log -e robots=off -w 1 -r -l 10 http://www.stackoverflow.com
Broken down:
--spider: When invoked with this option, Wget will behave as a Web spider, which means that it will not download the pages, just check that they are there.
-o C:\wget.log: Log all messages to C:\wget.log.
-e robots=off: Ignore robots.txt
-w 1: set a wait time of 1 second
-r: set recursive search on
-l 10: sets the recursive depth to 10, meaning wget will only go as deep as 10 levels in, this may need to change depending on your max requests
http://www.stackoverflow.com: the URL you want to start with
Once complete, you can review the wget.log entries to determine which links had errors by searching for something like HTTP status codes 404, etc.
I suspect your regex is part of your problem. Right now, you have http outside your capture group, and [\s:] matches "some sort of whitespace (ie \s) or :"
I'd change the regex to: urls_list = re.findall(r'href="(.*)"',s). Also known as "match anything in quotes, after href=". If you absolutely need to ensure the http[s]://, use r'href="(https?://.*)"' (s? => one or zero s)
EDIT: And with actually working regex, using a non-greedly glom: href=(?P<q>[\'"])(https?://.*?)(?P=q)'
(Also, uh, while it's not technically necessary in your case because re caches, I think it's good practice to get into the habit of using re.compile.)
I think it's awfully nice that all of your URLs are full URLs. Do you have to deal with relative URLs at all?
`
I currently write a python crawler, I want to switch to the next page but what is the best pratice ?
Actually it's simple, the end of url is .html?page=1, so I can increment page number but is there a best pratice to do this thing as clean as possible ?
I use urllib, url parse and beautifulSoup
#!/usr/bin/env python2
import urllib
import urlparse
from bs4 import BeautifulSoup
def getURL():
try:
fo = open("WebsiteToCrawl", "rw")
print ok() + "Data to crawl a store in : ", fo.name
except:
print fail() + "File doesn't exist, please create WebSiteTOCrawl file for store website listing"
line = fo.readlines()
print ok() + "Return website : %s" % (line)
fo.close()
i= 0
while i<len(line):
try:
returnDATA = urllib.urlopen(line[i]).read()
print ok() + "Handle :" + line[i]
handleDATA(returnDATA)
except:
print fail() + "Can't open url"
i += 1
def handleDATA(returnDATA):
try:
soup = BeautifulSoup(returnDATA)
for link in soup.find_all('a'):
urls = link.get('href')
try:
print urls
except:
print end() + "EOF: All site crawled"
def main():
useDATA = getURL()
handleDATA(useDATA)
if __name__ == "__main__":
main()
NB: I've simpfly the code than the original
If it's as straightforward as changing the number in the url, then do that.
However, you should consider how you're going to know when to stop. If the page returns pagination detail at the bottom (e.g. Back 1 2 3 4 5 ... 18 Next) then you could grab the contents of that element and find the 18.
An alternative, albeit slower, would be to parse the pagination links on each page and follow them manually by opening the url directly or using a click method to click next until next no longer appears on the page. I don't use urllib directly but it can be done super easily with Selenium's python bindings (driven by PhantomJS if you need it be headless). You could also do this whole routine with probably an even smaller amount of code using RoboBrowser if you don't have AJAX to deal with.
I have the following sample code where I download a pdf from the European Parliament website on a given legislative proposal:
EDIT: I ended up just getting the link and feeding it to adobes online conversion tool (see the code below):
import mechanize
import urllib2
import re
from BeautifulSoup import *
adobe = "http://www.adobe.com/products/acrobat/access_onlinetools.html"
url = "http://www.europarl.europa.eu/oeil/search_reference_procedure.jsp"
def get_pdf(soup2):
link = soup2.findAll("a", "com_acronym")
new_link = []
amendments = []
for i in link:
if "REPORT" in i["href"]:
new_link.append(i["href"])
if new_link == None:
print "No A number"
else:
for i in new_link:
page = br.open(str(i)).read()
bs = BeautifulSoup(page)
text = bs.findAll("a")
for i in text:
if re.search("PDF", str(i)) != None:
pdf_link = "http://www.europarl.europa.eu/" + i["href"]
pdf = urllib2.urlopen(pdf_link)
name_pdf = "%s_%s.pdf" % (y,p)
localfile = open(name_pdf, "w")
localfile.write(pdf.read())
localfile.close()
br.open(adobe)
br.select_form(name = "convertFrm")
br.form["srcPdfUrl"] = str(pdf_link)
br["convertTo"] = ["html"]
br["visuallyImpaired"] = ["notcompatible"]
br.form["platform"] =["Macintosh"]
pdf_html = br.submit()
soup = BeautifulSoup(pdf_html)
page = range(1,2) #can be set to 400 to get every document for a given year
year = range(1999,2000) #can be set to 2011 to get documents from all years
for y in year:
for p in page:
br = mechanize.Browser()
br.open(url)
br.select_form(name = "byReferenceForm")
br.form["year"] = str(y)
br.form["sequence"] = str(p)
response = br.submit()
soup1 = BeautifulSoup(response)
test = soup1.find(text="No search result")
if test != None:
print "%s %s No page skipping..." % (y,p)
else:
print "%s %s Writing dossier..." % (y,p)
for i in br.links(url_regex="file.jsp"):
link = i
response2 = br.follow_link(link).read()
soup2 = BeautifulSoup(response2)
get_pdf(soup2)
In the get_pdf() function I would like to convert the pdf file to text in python so I can parse the text for information about the legislative procedure. can anyone explaon me how this can be done?
Thomas
Sounds like you found a solution, but if you ever want to do it without a web service, or you need to scrape data based on its precise location on the PDF page, can I suggest my library, pdfquery? It basically turns the PDF into an lxml tree that can be spit out as XML, or parsed with XPath, PyQuery, or whatever else you want to use.
To use it, once you had the file saved to disk you would return pdf = pdfquery.PDFQuery(name_pdf), or pass in a urllib file object directly if you didn't need to save it. To get XML out to parse with BeautifulSoup, you could do pdf.tree.tostring().
If you don't mind using JQuery-style selectors, there's a PyQuery interface with positional extensions, which can be pretty handy. For example:
balance = pdf.pq(':contains("Your balance is")').text()
strings_near_the_bottom_of_page_23 = [el.text for el in pdf.pq('LTPage[page_label=23] :in_bbox(0, 0, 600, 200)')]
It's not exactly magic. I suggest
downloading the PDF file to a temp directory,
calling out to an external program to extract the text into a (temp) text file,
reading the text file.
For text extraction command-line utilities you have a number of possibilities and there may be others not mentioned in the link (perhaps Java-based). Try them first to see if they fit your needs. That is, try each step separately (finding the links, downloading the files, extracting the text) and then piece them together. For calling out, use subprocess.Popen or subprocess.call().