Python HTML parsing: getting site top level hosts - python

I have a program that takes in a site's source code/html and outputs the a href tags - it is extremely helpful and makes use of BeautifulSoup4.
I am wanting to have a variation of this code that only looks at < a href="..."> tags but returns just top directory host names from a site's source codes, for example
stackoverflow.com
google.com
etc. but NOT lower level ones like stackoverflow.com/questions/ etc. Right now it's outputting everything, including /, #t8 etc. and I need to filter them out.
Here is my current code I use to extract all a href tags.
url = sys.argv[1] #when program is invoked, takes it in like www.google.com etc.
html = urllib.urlopen(url).read()
soup = BeautifulSoup(html)
# get hosts
for a in soup.find_all('a', href=True):
print a['href']
Thank you!

It sounds like you're looking for the .netloc attribute of urlparse. It's part of the Python standard library: https://docs.python.org/2/library/urlparse.html
For example:
>>> from urlparse import urlparse
>>> url = "http://stackoverflow.com/questions/26351727/python-html-parsing-getting-site-top-level-hosts"
>>> urlparse(url).netloc
'stackoverflow.com'

Related

Python and Beautiful soup || Regex with Varible before writting to file

I would love some assistance or help around an issue i'm currently having.
I'm working on a little python scanner as a project.
The libraries im current importing are:
requests
BeautifulSoup
re
tld
The exact issue is regarding 'scope' of the scanner.
I'd like to pass a URL to the code and have the scanner grab all the anchor tags from the page, but only the ones relevant to the base URL, ignoring out of scope links and also subdomains.
Here is my current code, i'm by no means a programmer, so please excuse sloppy inefficient code.
import requests
from bs4 import BeautifulSoup
import re
from tld import get_tld, get_fld
#This Grabs the URL
print("Please type in a URL:")
URL = input()
#This strips out everthing leaving only the TLD (Future scope function)
def strip_domain(URL):
global domain_name
domain_name = get_fld(URL)
strip_domain(URL)
#This makes the request, and cleans up the source code
def connection(URL):
r = requests.get(URL)
status = r.status_code
sourcecode = r.text
soup = BeautifulSoup(sourcecode,features="html.parser")
cleanupcode = soup.prettify()
#This Strips the Anchor tags and adds them to the links array
links = []
for link in soup.findAll('a', attrs={'href': re.compile("^http://")}):
links.append(link.get('href'))
#This writes our clean anchor tags to a file
with open('source.txt', 'w') as f:
for item in links:
f.write("%s\n" % item)
connection(URL)
The exact code issue is around the "for link in soup.find" section.
I have been trying to parse the array for anchor tags the only contain the base domain, which is the global var "domain_name" so that it only writes the relevant links to the source txt file.
google.com accepted
google.com/file accepted
maps.google.com not written
If someone could assist me or point me in the right direction i'd appreciate it.
I was also thinking it would be possible to write every link to the source.txt file and then alter it after removing the 'out of scope' links, but really thought it more beneficial to do it without having to create additional code.
Additionally, i'm not the strongest with regex, but here is someone that my help.
This is some regex code to catch all variations of http, www, https
(^http:\/\/+|www.|https:\/\/)
To this I was going to append
.*{}'.format(domain_name)
I provide two different situationes. Because i donot agree that href value is xxx.com. Actually you will gain three or four or more kinds of href value, such as /file, folder/file, etc. So you have to transform relative path to absolute path, otherwise, you can not gather all of urls.
Regex: (\/{2}([w]+.)?)([a-z.]+)(?=\/?)
(\/{2}([w]+.)?) Matching non-main parts start from //
([a-z.]+)(?=\/?) Match all specified character until we got /, we ought not to use .*(over-match)
My Code
import re
_input = "http://www.google.com/blabla"
all_part = re.findall(r"(\/{2}([w]+.)?)([a-z.]+)(?=\/?)",_input)[0]
_partA = all_part[2] # google.com
_partB = "".join(all_part[1:]) # www.google.com
print(_partA,_partB)
site = [
"google.com",
"google.com/file",
"maps.google.com"
]
href = [
"https://www.google.com",
"https://www.google.com/file",
"http://maps.google.com"
]
for ele in site:
if re.findall("^{}/?".format(_partA),ele):
print(ele)
for ele in href:
if re.findall("{}/?".format(_partB),ele):
print(ele)

Retrieving a subset of href's from findall() in BeautifulSoup

My goal is to write a python script that takes an artist's name as a string input and then appends it to the base URL that goes to the genius search query.Then retrieves all the lyrics from the returned web page's links (Which is the required subset of this problem that will also contain specifically the artist name in every link in that subset.).I am in the initial phase right now and just have been able to retrieve all links from the web page including the ones that I don't want in my subset. I tried to find a simple solution but failed continuously.
import requests
# The Requests library.
from bs4 import BeautifulSoup
from lxml import html
user_input = input("Enter Artist Name = ").replace(" ","+")
base_url = "https://genius.com/search?q="+user_input
header = {'User-Agent':''}
response = requests.get(base_url, headers=header)
soup = BeautifulSoup(response.content, "lxml")
for link in soup.find_all('a',href=True):
print (link['href'])
This returns this complete list while I only need the ones that end with lyrics and the artist's name (here for instance Drake). These will the links from where I should be able to retrieve the lyrics.
https://genius.com/
/signup
/login
https://www.facebook.com/geniusdotcom/
https://twitter.com/Genius
https://www.instagram.com/genius/
https://www.youtube.com/user/RapGeniusVideo
https://genius.com/new
https://genius.com/Drake-hotline-bling-lyrics
https://genius.com/Drake-one-dance-lyrics
https://genius.com/Drake-hold-on-were-going-home-lyrics
https://genius.com/Drake-know-yourself-lyrics
https://genius.com/Drake-back-to-back-lyrics
https://genius.com/Drake-all-me-lyrics
https://genius.com/Drake-0-to-100-the-catch-up-lyrics
https://genius.com/Drake-started-from-the-bottom-lyrics
https://genius.com/Drake-from-time-lyrics
https://genius.com/Drake-the-motto-lyrics
/search?page=2&q=drake
/search?page=3&q=drake
/search?page=4&q=drake
/search?page=5&q=drake
/search?page=6&q=drake
/search?page=7&q=drake
/search?page=8&q=drake
/search?page=9&q=drake
/search?page=672&q=drake
/search?page=673&q=drake
/search?page=2&q=drake
/embed_guide
/verified-artists
/contributor_guidelines
/about
/static/press
mailto:brands#genius.com
https://eventspace.genius.com/
/static/privacy_policy
/jobs
/developers
/static/terms
/static/copyright
/feedback/new
https://genius.com/Genius-how-genius-works-annotated
https://genius.com/Genius-how-genius-works-annotated
My next step would be to use selenium to emulate scroll which in the case of genius.com gives the entire set of search results. Any suggestions or resources would be appreciated. I would also like a few comments about the way I wish to proceed with this solution. Can we make it more generic?
P.S. I may not have well lucidly explained my problem but I have tried my best. Also, any ambiguities are welcome too. I am new to scraping and python and programming as well in so, just wanted to make sure that I am following the right path.
Use the regex module to match only the links you want.
import requests
# The Requests library.
from bs4 import BeautifulSoup
from lxml import html
from re import compile
user_input = input("Enter Artist Name = ").replace(" ","+")
base_url = "https://genius.com/search?q="+user_input
header = {'User-Agent':''}
response = requests.get(base_url, headers=header)
soup = BeautifulSoup(response.content, "lxml")
pattern = re.compile("[\S]+-lyrics$")
for link in soup.find_all('a',href=True):
if pattern.match(link['href']):
print (link['href'])
Output:
https://genius.com/Drake-hotline-bling-lyrics
https://genius.com/Drake-one-dance-lyrics
https://genius.com/Drake-hold-on-were-going-home-lyrics
https://genius.com/Drake-know-yourself-lyrics
https://genius.com/Drake-back-to-back-lyrics
https://genius.com/Drake-all-me-lyrics
https://genius.com/Drake-0-to-100-the-catch-up-lyrics
https://genius.com/Drake-started-from-the-bottom-lyrics
https://genius.com/Drake-from-time-lyrics
https://genius.com/Drake-the-motto-lyrics
This just looks if your link matches the pattern ending in -lyrics. You may use similar logic to filter using user_input variable as well.
Hope this helps.

Checking webpage for results with python and beautifulsoup

I need to check a webpage search results and compare them to user input.
ui = raw_input() #for example "Niels Bohr"
link = "http://www.enciklopedija.hr/Trazi.aspx?t=profesor,%20gdje&s=90&k=10"
stranica=urllib.urlopen(link)
soup = BeautifulSoup(stranica, from_encoding="utf-8")
beauty = soup.prettify()
print beauty
since there is 1502 results, my idea was to change the k=10 to k=1502. Now I need some kind of function to check if search results contain my user input. I know that my names are the text after TEXT
so how to do it? maybe using regex?
the second part is if there are matching results to get the link of the results. Again, I know that link is inside that href="", but how to get it out and make it usable=
Finding if Niels Bohr is listed is as easy as using a large batch number and loading the resulting page:
import sys
import urllib2
from bs4 import BeautifulSoup
url = "http://www.enciklopedija.hr/Trazi.aspx?t=profesor,%20gdje&s=0&k={}".format(sys.maxint)
name = u'Bohr, Niels'
page = urllib2.urlopen(url)
soup = BeautifulSoup(page.read())
for link in soup.find_all(class_='AllWordsTextHit', text=name):
print link
This produces any links that contain the text 'Bohr, Niels' as the link text. You can use a regular expression if you need a partial match.
The link object has a (relative) href attribute you can then use to load the next page:
professor_page = 'http://www.enciklopedija.hr/' + link['href']

regex pattern in python for parsing HTML title tags

I am learning to use both the re module and the urllib module in python and attempting to write a simple web scraper. Here's the code I've written to scrape just the title of websites:
#!/usr/bin/python
import urllib
import re
urls=["http://google.com","https://facebook.com","http://reddit.com"]
i=0
these_regex="<title>(.+?)</title>"
pattern=re.compile(these_regex)
while(i<len(urls)):
htmlfile=urllib.urlopen(urls[i])
htmltext=htmlfile.read()
titles=re.findall(pattern,htmltext)
print titles
i+=1
This gives the correct output for Google and Reddit but not for Facebook - like so:
['Google']
[]
['reddit: the front page of the internet']
This is because, I found that on Facebook's page the title tag is as follows: <title id="pageTitle">. To accomodate for the additional id=, I modified the these_regex variable as follows: these_regex="<title.+?>(.+?)</title>". But this gives the following output:
[]
['Welcome to Facebook \xe2\x80\x94 Log in, sign up or learn more']
[]
How would I combine both so that I can take into account any additional parameters passed within the title tag?
It is recommended that you use Beautiful Soup or any other parser to parse HTML, but if you badly want regex the following piece of code would do the job.
The regex code:
<title.*?>(.+?)</title>
How it works:
Produces:
['Google']
['Welcome to Facebook - Log In, Sign Up or Learn More']
['reddit: the front page of the internet']
You are using a regular expression, and matching HTML with such expressions get too complicated, too fast.
Use a HTML parser instead, Python has several to choose from. I recommend you use BeautifulSoup, a popular 3rd party library.
BeautifulSoup example:
from bs4 import BeautifulSoup
response = urllib2.urlopen(url)
soup = BeautifulSoup(response.read(), from_encoding=response.info().getparam('charset'))
title = soup.find('title').text
Since a title tag itself doesn't contain other tags, you can get away with a regular expression here, but as soon as you try to parse nested tags, you will run into hugely complex issues.
Your specific problem can be solved by matching additional characters within the title tag, optionally:
r'<title[^>]*>([^<]+)</title>'
This matches 0 or more characters that are not the closing > bracket. The '0 or more' here lets you match both extra attributes and the plain <title> tag.
If you wish to identify all the htlm tags, you can use this
batRegex = re.compile(r'(<[a-z]*>)')
m1=batRegex.search(html)
print batRegex.findall(yourstring)
You could scrape a bunch of titles with a couple lines of gazpacho:
from gazpacho import Soup
urls = ["http://google.com", "https://facebook.com", "http://reddit.com"]
titles = []
for url in urls:
soup = Soup.get(url)
title = soup.find("title", mode="first").text
titles.append(title)
This will output:
titles
['Google',
'Facebook - Log In or Sign Up',
'reddit: the front page of the internet']

Can not find element with requests/BeautifulSoup

I write a web scraper with requests and BeautifulSoup, and there's an element in the DOM I can't find.
Here's what I do:
import requests
from bs4 import BeautifulSoup
r = requests.get('http://www.decitre.fr/rechercher/result/?q=victor+hugo&search-scope=3')
soup = BeautifulSoup(r.text)
The element I can't find is the "old-price" (the one which is striked-through), which I can see when I inspect the DOM with a browser dev tool.
soup.find_all(class_='old-price') # returns [], no matter if I specify "span"
Moreover I can't see the 'old-price' string in the soup or the result of the request:
'old-price' in soup.text # False
'old-price' in r.text # False
I can't see it when I get the source with wget too.
I can get its div parent, but can't find price children inside it:
commands = soup.find_all(class_='product_commande')
commands[0].find_all('old-price') # []
So I have no idea what's going on. What am I missing ?
do I badly use request/BeautifulSoup ? (I'm not sure if r.text returns the full html)
is that html part generated with a javascript code ? if so how can I know it and is there a way to get the complete html ?
many thanks
In my case I was passing invalid HTML into Beautiful Soup which was causing it to ignore everything after the invalid tag at the start of the document:
<!--?xml version="1.0" encoding="iso-8859-1"?-->
Note that I am also using Ghost.py. Here is how I removed the tag.
#remove invalid xml tag
ghostContent = ghost.content
invalidCode = '<!--?xml version="1.0" encoding="iso-8859-1"?-->'
if ghostContent.startswith(invalidCode):
ghostContent = ghostContent[len(invalidCode):]
doc = BeautifulSoup(ghostContent)
#test to see if we can find text
if 'Application Search Results' in doc.text:
print 'YES!'

Categories

Resources