Checking ALL links within links from a source HTML, Python - python

My code is to search a Link passed in the command prompt, get the HTML code for the webpage at the Link, search the HTML code for links on the webpage, and then repeat these steps for the links found. I hope that is clear.
It should print out any links that cause errors.
Some more needed info:
The max visits it can do is 100.
If a website has an error, a None value is returned.
Python3 is what I am using
eg:
s = readwebpage(url)... # This line of code gets the HTML code for the link(url) passed in its argument.... if the link has an error, s = None.
The HTML code for that website has links that end in p2.html, p3.html, p4.html, and p5.html on its webpage. My code reads all of these, but it does not visit these links individually to search for more links. If it did this, it should search through these links and find a link that ends in p10.html, and then it should report that the link ending with p10.html has errors. Obviously it doesn't do that at the moment, and it's giving me a hard time.
My code..
url = args.url[0]
url_list = [url]
checkedURLs = []
AmountVisited = 0
while (url_list and AmountVisited<maxhits):
url = url_list.pop()
s = readwebpage(url)
print("testing url: http",url) #Print the url being tested, this code is here only for testing..
AmountVisited = AmountVisited + 1
if s == None:
print("* bad reference to http", url)
else:
urls_list = re.findall(r'href="http([\s:]?[^\'" >]+)', s) #Creates a list of all links in HTML code starting with...
while urls_list: #... http or https
insert = urls_list.pop()
while(insert in checkedURLs and urls_list):
insert = urls_list.pop()
url_list.append(insert)
checkedURLs = insert
Please help :)

Here is the code you wanted. However, please, stop using regexes for parsing HTML. BeautifulSoup is the way to go for that.
import re
from urllib import urlopen
def readwebpage(url):
print "testing ",current
return urlopen(url).read()
url = 'http://xrisk.esy.es' #put starting url here
yet_to_visit= [url]
visited_urls = []
AmountVisited = 0
maxhits = 10
while (yet_to_visit and AmountVisited<maxhits):
print yet_to_visit
current = yet_to_visit.pop()
AmountVisited = AmountVisited + 1
html = readwebpage(current)
if html == None:
print "* bad reference to http", current
else:
r = re.compile('(?<=href=").*?(?=")')
links = re.findall(r,html) #Creates a list of all links in HTML code starting with...
for u in links:
if u in visited_urls:
continue
elif u.find('http')!=-1:
yet_to_visit.append(u)
print links
visited_urls.append(current)

Not Python but since you mentioned you aren't tied strictly to regex, I think you might find some use in using wget for this.
wget --spider -o C:\wget.log -e robots=off -w 1 -r -l 10 http://www.stackoverflow.com
Broken down:
--spider: When invoked with this option, Wget will behave as a Web spider, which means that it will not download the pages, just check that they are there.
-o C:\wget.log: Log all messages to C:\wget.log.
-e robots=off: Ignore robots.txt
-w 1: set a wait time of 1 second
-r: set recursive search on
-l 10: sets the recursive depth to 10, meaning wget will only go as deep as 10 levels in, this may need to change depending on your max requests
http://www.stackoverflow.com: the URL you want to start with
Once complete, you can review the wget.log entries to determine which links had errors by searching for something like HTTP status codes 404, etc.

I suspect your regex is part of your problem. Right now, you have http outside your capture group, and [\s:] matches "some sort of whitespace (ie \s) or :"
I'd change the regex to: urls_list = re.findall(r'href="(.*)"',s). Also known as "match anything in quotes, after href=". If you absolutely need to ensure the http[s]://, use r'href="(https?://.*)"' (s? => one or zero s)
EDIT: And with actually working regex, using a non-greedly glom: href=(?P<q>[\'"])(https?://.*?)(?P=q)'
(Also, uh, while it's not technically necessary in your case because re caches, I think it's good practice to get into the habit of using re.compile.)
I think it's awfully nice that all of your URLs are full URLs. Do you have to deal with relative URLs at all?
`

Related

Beautifulsoup unable to get data from mds-data-table from morningstar

I'm trying to get the dividend information from morningstar.
The following code works for scraping info from finviz but the dividend information is not the same as my broker platform.
symbol = 'bxs'
morningstar_url = 'https://www.morningstar.com/stocks/xnys/' + symbol + '/dividends'
http = urllib3.PoolManager()
response = http.request('GET', morningstar_url)
soup = BeautifulSoup(response.data, 'lxml')
html = list(soup.children)[1]
[type(item) for item in list(soup.children)]
def display_elements(L, show = 0):
test = list(L.children)
if(show):
for i in range(len(test)):
print(i)
print(test[i])
print()
return(test)
test = display_elements(html,1)
I have no issue printing out the elements but cannot find the element that houses the information such as "Total Yield %" of 2.8%. How do I get inside the mds-data-table to extract the information?
Great question! I've actually worked on this specifically, but years ago. Morningstar will only load the tables after running a script to prevent this exact type of scraping behavior. If you view source generally, immediately on load, you won't be able to see any HTML.
What your going to want to do is find the JavaScript code that is loading the elements, and hook up bs4 to use that. You'll have to poke around the files, but somewhere deep in those js files, you'll find a dynamic URL. It'll be hidden, but it'll be in there somewhere. I'll go look at some of my old code and see if i can find something that helps.
So here's an edited sample of what used to work for me:
from urllib.request import urlopen
exchange = 'NYSE'
ticker = 'V'
if exchange == 'NYSE':
exchange_code = "XNYS"
elif exchange in ["NasdaqNM", "NASDAQ"]:
exchange_code = "XNAS"
else:
logging.info("Unknown Exchange Code for {}".format(stock.symbol))
return
time_now = int(time.time())
time_delay = int(time.time()+150)
morningstar_raw = urlopen(f'http://financials.morningstar.com/ajax/ReportProcess4HtmlAjax.html?&t={exchange_code}:{ticker}&region=usa&culture=en-US&cur=USD&reportType=is&period=12&dataType=A&order=asc&columnYear=5&rounding=3&view=raw&r=354589&callback=jsonp{time_now}&_={time_delay}')
print(morningstar_raw)
Granted this solution is from a file lasted edited sometime in 2018, and they may have changed up their scripting, but you can find this and much more on my github project wxStocks

Access only links with given format from a python list

I have written a code that fetches the html code of any given site and then fetch all links from it and save it inside a list. My goal is that I want to change all the relative links in html file with absolute links.
Here are the links:
src="../styles/scripts/jquery-1.9.1.min.js"
href="/PhoneBook.ico"
href="../css_responsive/fontsss.css"
src="http://www.google.com/adsense/search/ads.js"
L.src = '//www.google.com/adsense/search/async-ads.js'
href="../../"
src='../../images/plus.png'
vrUrl ="search.aspx?searchtype=cat"
These are few links that I have copied from html file to keep the question simple and less error prone.
Following are the different URLs used in html file:
http://yourdomain.com/images/example.png
//yourdomain.com/images/example.png
/images/example.png
images/example.png
../images/example.png
../../images/example.png
Python code:
linkList = re.findall(re.compile(u'(?<=href=").*?(?=")|(?<=href=\').*?(?=\')|(?<=src=").*?(?=")|(?<=src=\').*?(?=\')|(?<=action=").*?(?=")|(?<=vrUrl =").*?(?=")|(?<=\')//.*?(?=\')'), str(html))
newLinks = []
for link1 in linkList:
if (link1.startswith("//")):
newLinks.append(link1)
elif (link1.startswith("../")):
newLinks.append(link1)
elif (link1.startswith("../../")):
newLinks.append(link1)
elif (link1.startswith("http")):
newLinks.append(link1)
elif (link1.startswith("/")):
newLinks.append(link1)
else:
newLinks.append(link1)
At this point what happens is when it comes to second condition which is "../" it gives me all the urls which starts with "../" as well as "../../". This is the behavior which I don't need. Same goes for "/"; it also fetches urls starting with "//". I also tried to used the beginning and end parameters of "startswith" function but that doesn't solve the issue.
How about using str.count method:
>>> src="../styles/scripts/jquery-1.9.1.min.js"
>>> src2='../../images/plus.png'
>>> src.count('../')
1
>>> src2.count('../')
2
This seems to be true as ../ only exists at the beginning of urls

Problems crawling wordreference

I am trying to crawl wordreference, but I am not succeding.
The first problem I have encountered is, that a big part is loaded via JavaScript, but that shouldn't be much problem because I can see what I need in the source code.
So, for example, I want to extract for a given word, the first two meanings, so in this url: http://www.wordreference.com/es/translation.asp?tranword=crane I need to extract grulla and grĂșa.
This is my code:
import lxml.html as lh
import urllib2
url = 'http://www.wordreference.com/es/translation.asp?tranword=crane'
doc = lh.parse((urllib2.urlopen(url)))
trans = doc.xpath('//td[#class="ToWrd"]/text()')
for i in trans:
print i
The result is that I get an empty list.
I have tried to crawl it with scrapy too, no success. I am not sure what is going on, the only way I have been able to crawl it is using curl, but that is sloopy, I want to do it in an elegant way, with Python.
Thank you very much
It looks like you need a User-Agent header to be sent, see Changing user agent on urllib2.urlopen.
Also, just switching to requests would do the trick (it automatically sends the python-requests/version User Agent by default):
import lxml.html as lh
import requests
url = 'http://www.wordreference.com/es/translation.asp?tranword=crane'
response = requests.get("http://www.wordreference.com/es/translation.asp?tranword=crane")
doc = lh.fromstring(response.content)
trans = doc.xpath('//td[#class="ToWrd"]/text()')
for i in trans:
print(i)
Prints:
grulla
grĂșa
plataforma
...
grulla blanca
grulla trompetera

Scraping urbandictionary with Python

I'm currently working on an arcbot and I'm trying to make a command "!urbandictionary", it should scrape the meaning of a term, the first one which is provided by urbandictionary, if there's another solution, e.g. another dictionary site with a better api that's also good. Here's my code:
if Command.lower() == '!urban':
dictionary = Argument[1] #this is the term which the user provides, e.g. "scrape"
dictionaryscrape = urllib2.urlopen('http://www.urbandictionary.com/define.php?term='+dictionary).read() #plain html of the site
scraped = getBetweenHTML(dictionaryscrape, '<div class="meaning">','</div>') #Here's my problem, i'm not sure if it scrapes the first meaning or not..
messages.main(scraped,xSock,BotID) #Sends the meaning of the provided word (Argument[0])
How do I correctly scrape a meaning of a word in urbandictionary?
Just get the text from the meaning class:
import requests
from bs4 import BeautifulSoup
word = "scrape"
r = requests.get("http://www.urbandictionary.com/define.php?term={}".format(word))
soup = BeautifulSoup(r.content)
print(soup.find("div",attrs={"class":"meaning"}).text)
Gassing and breaking your car repeatedly really fast so that the front and rear bumpers "scrape" the pavement; while going hyphy
There is an unofficial api here apparently
`http://api.urbandictionary.com/v0/define?term={word}`
From https://github.com/zdict/zdict/wiki/Urban-dictionary-API-documentation

Extracting parts of a webpage with python

So I have a data retrieval/entry project and I want to extract a certain part of a webpage and store it in a text file. I have a text file of urls and the program is supposed to extract the same part of the page for each url.
Specifically, the program copies the legal statute following "Legal Authority:" on pages such as this. As you can see, there is only one statute listed. However, some of the urls also look like this, meaning that there are multiple separated statutes.
My code works for pages of the first kind:
from sys import argv
from urllib2 import urlopen
script, urlfile, legalfile = argv
input = open(urlfile, "r")
output = open(legalfile, "w")
def get_legal(page):
# this is where Legal Authority: starts in the code
start_link = page.find('Legal Authority:')
start_legal = page.find('">', start_link+1)
end_link = page.find('<', start_legal+1)
legal = page[start_legal+2: end_link]
return legal
for line in input:
pg = urlopen(line).read()
statute = get_legal(pg)
output.write(get_legal(pg))
Giving me the desired statute name in the "legalfile" output .txt. However, it cannot copy multiple statute names. I've tried something like this:
def get_legal(page):
# this is where Legal Authority: starts in the code
end_link = ""
legal = ""
start_link = page.find('Legal Authority:')
while (end_link != '</a> '):
start_legal = page.find('">', start_link+1)
end_link = page.find('<', start_legal+1)
end2 = page.find('</a> ', end_link+1)
legal += page[start_legal+2: end_link]
if
break
return legal
Since every list of statutes ends with '</a> ' (inspect the source of either of the two links) I thought I could use that fact (having it as the end of the index) to loop through and collect all the statutes in one string. Any ideas?
I would suggest using BeautifulSoup to parse and search your html. This will be much easier than doing basic string searches.
Here's a sample that pulls all the <a> tags found within the <td> tag that contains the <b>Legal Authority:</b> tag. (Note that I'm using requests library to fetch page content here - this is just a recommended and very easy to use alternative to urlopen.)
import requests
from BeautifulSoup import BeautifulSoup
# fetch the content of the page with requests library
url = "http://www.reginfo.gov/public/do/eAgendaViewRule?pubId=200210&RIN=1205-AB16"
response = requests.get(url)
# parse the html
html = BeautifulSoup(response.content)
# find all the <a> tags
a_tags = html.findAll('a', attrs={'class': 'pageSubNavTxt'})
def fetch_parent_tag(tags):
# fetch the parent <td> tag of the first <a> tag
# whose "previous sibling" is the <b>Legal Authority:</b> tag.
for tag in tags:
sibling = tag.findPreviousSibling()
if not sibling:
continue
if sibling.getText() == 'Legal Authority:':
return tag.findParent()
# now, just find all the child <a> tags of the parent.
# i.e. finding the parent of one child, find all the children
parent_tag = fetch_parent_tag(a_tags)
tags_you_want = parent_tag.findAll('a')
for tag in tags_you_want:
print 'statute: ' + tag.getText()
If this isn't exactly what you needed to do, BeautifulSoup is still the tool you likely want to use for sifting through html.
They provide XML data over there, see my comment. If you think you can't download that many files (or the other end could dislike so many HTTP GET requests), I'd recommend asking their admins if they would kindly provide you with a different way of accessing the data.
I have done so twice in the past (with scientific databases). In one instance the sheer size of the dataset prohibited a download; they ran a SQL query of mine and e-mailed the results (but had previously offered to mail a DVD or hard disk). In another case, I could have done some million HTTP requests to a webservice (and they were ok) each fetching about 1k bytes. This would have taken long, and would have been quite inconvenient (requiring some error-handling, since some of these requests would always time out) (and non-atomic due to paging). I was mailed a DVD.
I'd imagine that the Office of Management and Budget could possibly be similar accomodating.

Categories

Resources