Web Scraping Emails using Python - python

new to web scraping (using python) and encountered a problem trying to get an email from a university's athletic department site.
I've managed to get to navigate to the email I want to extract but don't know where to go from here. When I print what I have, all I get is '' and not the actual text of the email.
I'm attaching what I have so far, let me know if it needs a better explanation.
Here's a link to an image of what I'm trying to scrape. Website
and the website: https://goheels.com/staff-directory
Thanks!
Here's my code:
from bs4 import BeautifulSoup
import requests
urls = ''
with open('websites.txt', 'r') as f:
for line in f.read():
urls += line
urls = list(urls.split())
print(urls)
for url in urls:
res = requests.get(url)
soup = BeautifulSoup(res.text, 'html.parser')
try:
body = soup.find(headers="col-staff_email category-0")
links = body.a
print(links)
except Exception as e:
print(f'"This url didn\'t work:" {url}')

The emails are hidden inside a <script> element. With a little pushing, shoving, css selecting and string splitting you can get there:
for em in soup.select('td[headers*="col-staff_email"] script'):
target = em.text.split('var firstHalf = "')[1]
fh = target.split('";')[0]
lh = target.split('var secondHalf = "')[1].split('";')[0]
print(fh+ '#' +lh)
Output:
bubba.cunningham#unc.edu
molly.dalton#unc.edu
athgallo#unc.edu
dhollier#unc.edu
etc.

Related

Beautifulsoup "findAll()" does not return the tags

I am trying to build a scraper to get some abstracts of academic papers and their corresponding titles on this page.
The problem is that my for link in bsObj.findAll('a',{'class':'search-track'}) does not return the links I need to further build my scraper. In my code, the check is like this:
for link in bsObj.findAll('a',{'class':'search-track'}):
print(link)
The for loop above does print out anything, however, the href links should be inside the <a class="search-track" ...</a>.
I have referred to this post, but changing the Beautifulsoup parser is not solving the problem of my code. I am using "html.parser" in my Beautifulsoup constructor: bsObj = bs(html.content, features="html.parser").
And the print(len(bsObj)) prints out "3" while it prints out "2" for both "lxml" and "html5lib".
Also, I started off using urllib.request.urlopen to get the page and then tried requests.get() instead. Unfortunately the two approaches give me the same bsObj.
Here is the code I've written:
#from urllib.request import urlopen
import requests
from bs4 import BeautifulSoup as bs
import ssl
'''
The elsevier search is kind of a tree structure:
"keyword --> a list of journals (a journal contain many articles) --> lists of articles
'''
address = input("Please type in your keyword: ") #My keyword is catalyst for water splitting
#https://www.elsevier.com/en-xs/search-results?
#query=catalyst%20for%20water%20splitting&labels=journals&page=1
address = address.replace(" ", "%20")
address = "https://www.elsevier.com/en-xs/search-results?query=" + address + "&labels=journals&page=1"
journals = []
articles = []
def getJournals(url):
global journals
#html = urlopen(url)
html = requests.get(url)
bsObj = bs(html.content, features="html.parser")
#print(len(bsObj))
#testFile = open('testFile.txt', 'wb')
#testFile.write(bsObj.text.encode(encoding='utf-8', errors='strict') +'\n'.encode(encoding='utf-8', errors='strict'))
#testFile.close()
for link in bsObj.findAll('a',{'class':'search-track'}):
print(link)
########does not print anything########
'''
if 'href' in link.attrs and link.attrs['href'] not in journals:
newJournal = link.attrs['href']
journals.append(newJournal)
'''
return None
# Ignore SSL certificate errors
ctx = ssl.create_default_context()
ctx.check_hostname = False
ctx.verify_mode = ssl.CERT_NONE
getJournals(address)
print(journals)
Can anyone tell me what the problem is in my code that the for loop does not print out any links? I need to store the links of journals in a list and then visit each link to scrape the abstracts of papers. By right the abstracts part of a paper is free and the website shouldn't have blocked my ID because of it.
This page is dynamically loaded with jscript, so Beautifulsoup can't handle it directly. You may be able to do it using Selenium, but in this case you can do it by tracking the api calls made by the page (for more see, as one of many examples, here.
In your particular case it can be done this way:
from bs4 import BeautifulSoup as bs
import requests
import json
#this is where the data is hiding:
url = "https://site-search-api.prod.ecommerce.elsevier.com/search?query=catalyst%20for%20water%20splitting&labels=journals&start=0&limit=10&lang=en-xs"
html = requests.get(url)
soup = bs(html.content, features="html.parser")
data = json.loads(str(soup))#response is in json format so we load it into a dictionary
Note: in this case, it's also possible to dispense with Beautifulsoup altogether and load the response directly, as in data = json.loads(html.content). From this point:
hits = data['hits']['hits']#target urls are hidden deep inside nested dictionaries and lists
for hit in hits:
print(hit['_source']['url'])
Ouput:
https://www.journals.elsevier.com/water-research
https://www.journals.elsevier.com/water-research-x
etc.

How to have beautiful soup fetch emails?

Im working with beautiful soup and would like to grab emails to a depth of my choosing in my web scraper. Currently however I am unsure why my web scraping tool is not working. Everytime I run it, it does not populate the email list.
#!/usr/bin/python
from bs4 import BeautifulSoup, SoupStrainer
import re
import urllib
import threading
def step2():
file = open('output.html', 'w+')
file.close()
# links already added
visited = set()
visited_emails = set()
scrape_page(visited, visited_emails, 'https://www.google.com', 2)
print('Webpages \n')
for w in visited:
print(w)
print('Emails \n')
for e in visited_emails:
print(e)
# Run recursively
def scrape_page(visited, visited_emails, url, depth):
if depth == 0:
return
website = urllib.urlopen(url)
soup = BeautifulSoup(website, parseOnlyThese=SoupStrainer('a', email=False))
emails = re.findall(r"[A-Za-z0-9._%+-]+#[A-Za-z0-9.-]+\.[A-Za-z]{2,4}", str(website))
first = str(website).split('mailto:')
for i in range(1, len(first)):
print(first.split('>')[0])
for email in emails:
if email not in visited_emails:
print('- got email ' + email)
visited_emails.add(email)
for link in soup:
if link.has_attr('href'):
if link['href'] not in visited:
if link['href'].startswith('https://www.google.com'):
visited.add(link['href'])
scrape_page(visited, visited_emails, link['href'], depth - 1)
def main():
step2()
main()
for some reason im unsure how to fix my code to add emails to the list. if you could give me some advice it would be greatly appreciated. thanks
You just need to look for the href's with mailto:
emails = [a["href"] for a in soup.select('a[href^=mailto:]')]
I presume https://www.google.com is a placeholder for the actual site you are scraping as there are no mailto's to scrape on the google page. If there are mailto's in the source you are scraping then this will find them.

Searching unique web links

I wrote the program to extract the web links from http://www.stevens.edu/.
Now i am facing following problems with the program.
1- i want to get only links starting from http and https
2 - I am getting a parser warning from bs4 concerning the lack of specification on a parser - solved
How can fix this problems? I am not getting proper direction to solve this problem.
my code is -
import urllib2
from bs4 import BeautifulSoup as bs
url = raw_input('Please enter the url for which you want to see unique web links -')
print "\n"
URLs (mostly HTTP) in a complex world
req = urllib2.Request(url, headers={'User-Agent': 'Mozilla/5.0'})
html = urllib2.urlopen(req).read()
soup = bs(html)
tags = soup('a')
count = 0
web_link = []
for tag in tags:
count = count + 1
store = tag.get('href', None)
web_link.append(store)
print "Total no. of extracted web links are",count,"\n"
print web_link
print "\n"
Unique_list = set(web_link)
Unique_list = list(Unique_list)
print "No. of the Unique web links after using set method", len(Unique_list),"\n"
for second problem, you need to specify the parser while creating a bs of the page.
soup = bs(html,"html.parser")
This should remove your warning.

get all links that avalibale in a website using python?

are there are any way using python to get all links in the web site not only in the web page ? I tried this code but that's give me only links in the web page
import urllib2
import re
#connect to a URL
website = urllib2.urlopen('http://www.example.com/')
#read html code
html = website.read()
#use re.findall to get all the links
links = re.findall('"((http|ftp)s?://.*?)"', html)
print links
Visit recursively the links you have gathered and scrap these pages too:
import urllib2
import re
stack = ['http://www.example.com/']
results = []
while len(stack) > 0:
url = stack.pop()
#connect to a URL
website = urllib2.urlopen(url)
#read html code
html = website.read()
#use re.findall to get all the links
# you should not only gather links with http/ftps but also relative links
# you could use beautiful soup for that (if you want <a> links)
links = re.findall('"((http|ftp)s?://.*?)"', html)
result.extend([link in links if is_not_relative_link(link)])
for link in links:
if link_is_valid(link): #this function has to be written
stack.push(link)

Python Web Crawler from thenewboston

I recently watched a thenewboston video on writing a web crawler using python. For some reason, I'm getting a SSLError. I tried fixing it with line 6 of code but no luck. Any idea why it's throwing errors? The code is verbatim from thenewboston.
import requests
from bs4 import BeautifulSoup
def creepy_crawly(max_pages):
page = 1
#requests.get('https://www.thenewboston.com/', verify = True)
while page <= max_pages:
url = "https://www.thenewboston.com/trade/search.php?pages=" + str(page)
source_code = requests.get(url)
plain_text = source_code.text
soup = BeautifulSoup(plain_text)
for link in soup.findAll('a', {'class' : 'item-name'}):
href = "https://www.thenewboston.com" + link.get('href')
print(href)
page += 1
creepy_crawly(1)
I've done a web crawler using urllib, it can be faster and has no problem accessing https pages, one thing though is that it doesn't validates the server certificate, this make it faster but more dangerous ( vulnerable to mitm attacks).
Bellow there's an usage example of that lib:
link = 'https://www.stackoverflow.com'
html = urllib.urlopen(link).read()
print(html)
3 lines is all you need to grab the HTML from a page, simple isn't it?
More about urllib: https://docs.python.org/2/library/urllib.html
I also recommend you use regex on the HTML to grab other links, an example for that (using re library) would be:
for url in re.findall(r'<a[^>]+href=["\'](.[^"\']+)["\']', html, re.I): # Searches the HTML for other URLs
link = url.split("#", 1)[0] \
if url.startswith("http") \
else '{uri.scheme}://{uri.netloc}'.format(uri=urlparse.urlparse(origLink)) + url.split("#", 1)[0] # Checks if the HTML is valid and format it

Categories

Resources