I am using BeautifulSoup for the first time and trying to collect several data such as email,phone number, and mailing address from a soup object.
Using regular expressions, I can identify the email address. My code to find the email is:
def get_email(link):
mail_list = []
for i in link:
a = str(i)
email_pattern = re.compile("<a\s+href=\"mailto:([a-zA-Z0-9._#]*)\">", re.IGNORECASE)
ik = re.findall(email_pattern, a)
if (len(ik) == 1):
mail_list.append(i)
else:
pass
s_email = str(mail_list[0]).split('<a href="')
t_email = str(s_email[1]).split('">')
print t_email[0]
Now, I also need to collect the phone number, mailing address and web url. I think in BeautifulSoup there must be an easy way to find those specific data.
A sample html page is as below:
<ul>
<li>
<span>Email:</span>
Message Us
</li>
<li>
<span>Website:</span>
<a target="_blank" href="http://www.abcl.com">Visit Our Website</a>
</li>
<li>
<span>Phone:</span>
(123)456-789
</li>
</ul>
And using BeatifulSoup, I am trying to collect the span values of Email, website and Phone.
Thanks in advance.
The most obvious problem with your code is that you're turning the object representing the link back into HTML and then parsing it with a regular expression again - that ignores much of the point of using BeautifulSoup in the first place. You might need to use a regular expression to deal with the contents of the href attribute, but that's it. Also, the else: pass is unnecessary - you can just leave it out entirely.
Here's some code that does something like what you want, and might be a useful starting point:
from BeautifulSoup import BeautifulSoup
import re
# Assuming that html is your input as a string:
soup = BeautifulSoup(html)
all_contacts = []
def mailto_link(e):
'''Return the email address if the element is is a mailto link,
otherwise return None'''
if e.name != 'a':
return None
for key, value in e.attrs:
if key == 'href':
m = re.search('mailto:(.*)',value)
if m:
return m.group(1)
return None
for ul in soup.findAll('ul'):
contact = {}
for li in soup.findAll('li'):
s = li.find('span')
if not (s and s.string):
continue
if s.string == 'Email:':
a = li.find(mailto_link)
if a:
contact['email'] = mailto_link(a)
elif s.string == 'Website:':
a = li.find('a')
if a:
contact['website'] = a['href']
elif s.string == 'Phone:':
contact['phone'] = unicode(s.nextSibling).strip()
all_contacts.append(contact)
print all_contacts
That will produce a list of one dictionary per contact found, in this case that will just be:
[{'website': u'http://www.abcl.com', 'phone': u'(123)456-789', 'email': u'abc#gmail.com'}]
Related
I want to scrape emails of this link:
https://threebestrated.ca/children-dentists-in-airdrie-ab
but the output shows null because these are not in the view page source.
This is the code:
import scrapy
class BooksSpider(scrapy.Spider):
name = "3bestrated"
allowed_domains = ['threebestrated.ca']
start_urls = ["https://threebestrated.ca/children-dentists-in-airdrie-ab"]
def parse(self, response):
emails = response.xpath("//a[contains(#href, 'mailto:')]/text()").getall()
yield {
"a": emails,
}
The e-mail addresses are encoded in a certain way to prevent naive scraping. Here is one such encoded e-mail address:
<p>
<a href="/cdn-cgi/l/email-protection#3851565e57784b515d4a4a595c5d564c5954165b59074b4d5a525d5b4c056a5d494d5d4b4c1d0a084c504a574d5f501d0a086c504a5d5d7a5d4b4c6a594c5d5c165b59">
<i class="fa fa-envelope-o"></i>
<span class="__cf_email__" data-cfemail="70191e161f3003191502021114151e04111c5e1311">[email protected]</span>
</a>
</p>
Which is then decoded using this JavaScript script.
So, your options are:
Reverse-engineer the decoding script
Use some kind of JavaScript runtime to execute the decoding script
If you're going to use a JavaScript runtime, you might as well use
Selenium to begin with (there seems to exist a scrapy-selenium middleware that you could use if you want to stick with scrapy)
EDIT - I've reverse-engineered it for fun:
def deobfuscate(string, start_index):
def extract_hex(string, index):
substring = string[index: index+2]
return int(substring, 16)
key = extract_hex(string, start_index)
for index in range(start_index+2, len(string), 2):
yield chr(extract_hex(string, index) ^ key)
def process_tag(tag):
url_fragment = "/cdn-cgi/l/email-protection#"
href = tag["href"]
start_index = href.find(url_fragment)
if start_index > -1:
return "".join(deobfuscate(href, start_index + len(url_fragment)))
return None
def main():
import requests
from bs4 import BeautifulSoup as Soup
from urllib.parse import unquote
url = "https://threebestrated.ca/children-dentists-in-airdrie-ab"
response = requests.get(url)
response.raise_for_status()
soup = Soup(response.content, "html.parser")
print("E-Mail Addresses from <a> tags:")
for email in map(unquote, filter(None, map(process_tag, soup.find_all("a", href=True)))):
print(email)
cf_elem_attr = "data-cfemail"
print("\nE-Mail Addresses from tags where \"{}\" attribute is present:".format(cf_elem_attr))
for tag in soup.find_all(attrs={cf_elem_attr:True}):
email = unquote("".join(deobfuscate(tag[cf_elem_attr], 0)))
print(email)
if __name__ == "__main__":
import sys
sys.exit(main())
Output:
E-Mail Addresses from <a> tags:
info#sierradental.ca?subject=Request through ThreeBestRated.ca
reviews#threebestrated.ca?subject=My Review for Dr. Amin Salmasi in Airdrie
info#mainstreetdentalairdrie.ca?subject=Request through ThreeBestRated.ca
reviews#threebestrated.ca?subject=My Review for Dr. James Yue in Airdrie
friends#toothpals.ca?subject=Request through ThreeBestRated.ca
reviews#threebestrated.ca?subject=My Review for Dr. Christine Bell in Airdrie
support#threebestrated.ca
E-Mail Addresses from tags where "data-cfemail" attribute is present:
info#sierradental.ca
friends#toothpals.ca
support#threebestrated.ca
>>>
I have an assignment where one of the things I can do is find the first 3 sentences of a webpage and display it. Find the webpage text is easy enough, but I'm having problems figuring out how I find the first 3 sentences.
import requests
from bs4 import BeautifulSoup
url = 'https://www.troyhunt.com/the-773-million-record-collection-1-data-reach/'
res = requests.get(url)
html_page = res.content
soup = BeautifulSoup(html_page, 'html.parser')
text = soup.find_all(text=True)
output = ''
blacklist = [
'[document]',
'noscript',
'header',
'html',
'meta',
'head',
'input',
'script'
]
for t in text:
if (t.parent.name not in blacklist):
output += '{} '.format(t)
tempout = output.split('.')
for i in range(tempout):
if (i >= 3):
tempout.remove(i)
output = '.'.join(tempout)
print(output)
Finding sentences out of text is difficult. Normally you would look for characters that might complete a sentence, such as '.' and '!'. But a period ('.') could appear in the middle of a sentence as in an abbreviation of a person's name, for example. I use a regular expression to look for a period followed by either a single space or the end of the string, which works for the first three sentences, but not for any arbitrary sentence.
import requests
from bs4 import BeautifulSoup
import re
url = 'https://www.troyhunt.com/the-773-million-record-collection-1-data-reach/'
res = requests.get(url)
html_page = res.content
soup = BeautifulSoup(html_page, 'html.parser')
paragraphs = soup.select('section.article_text p')
sentences = []
for paragraph in paragraphs:
matches = re.findall(r'(.+?[.!])(?: |$)', paragraph.text)
needed = 3 - len(sentences)
found = len(matches)
n = min(found, needed)
for i in range(n):
sentences.append(matches[i])
if len(sentences) == 3:
break
print(sentences)
Prints:
['Many people will land on this page after learning that their email address has appeared in a data breach I\'ve called "Collection #1".', "Most of them won't have a tech background or be familiar with the concept of credential stuffing so I'm going to write this post for the masses and link out to more detailed material for those who want to go deeper.", "Let's start with the raw numbers because that's the headline, then I'll drill down into where it's from and what it's composed of."]
To scrape the first three sentences, just add these lines to ur code:
section = soup.find('section',class_ = "article_text post") #Finds the section tag with class "article_text post"
txt = section.p.text #Gets the text within the first p tag within the variable section (the section tag)
print(txt)
Output:
Many people will land on this page after learning that their email address has appeared in a data breach I've called "Collection #1". Most of them won't have a tech background or be familiar with the concept of credential stuffing so I'm going to write this post for the masses and link out to more detailed material for those who want to go deeper.
Hope that this helps!
Actually using beautify soup you can filter by the class "article_text post" seeing source code:
myData=soup.find('section',class_ = "article_text post")
print(myData.p.text)
And get the inner text of p element
Use this instead of soup = BeautifulSoup(html_page, 'html.parser')
This is the HTML:
<div><div id="NhsjLK">
<li class="EditableListItem NavListItem FollowersNavItem NavItem not_removable">
Followers <span class="list_count">92</span></li></div></div>
I want to extract the text 92 and convert it into integer and print in python2. How can I?
Code:
i = soup.find('div', id='NhsjLK')
print "Followers :", i.find('span', id='list_count').text
I'd not go with getting it by the class directly, since I think "list_count" is too broad of a class value and might be used for other things on the page.
There are definitely several different options judging by this HTML snippet alone, but one of the nicest, from my point of you, is to use that "Followers" text/label and get the next sibling of it:
from bs4 import BeautifulSoup
data = """
<div><div id="NhsjLK">
<li class="EditableListItem NavListItem FollowersNavItem NavItem not_removable">
Followers <span class="list_count">92</span></li></div></div>"""
soup = BeautifulSoup(data, "html.parser")
count = soup.find(text=lambda text: text and text.startswith('Followers')).next_sibling.get_text()
count = int(count)
print(count)
Or, an another, a very concise and reliable approach would be to use the partial match (the *= part below) on the href value of the parent a element:
count = int(soup.select_one("a[href*=followers] .list_count").get_text())
Or, you might check the class value of the parent li element:
count = int(soup.select_one("li.FollowersNavItem .list_count").get_text())
Hi I need to pass a variable to the soup.find() function, but it doesn't work :(
Does anyone know a solution for this?
from bs4 import BeautifulSoup
html = '''<div> blabla
<p class='findme'> p-tag content</p>
</div>'''
sources = {'source1': '\'p\', class_=\'findme\'',
'source2': '\'span\', class_=\'findme2\'',
'source1': '\'div\', class_=\'findme3\'',}
test = BeautifulSoup(html)
# this works
#print(test.find('p', class_='findme'))
# >>> <p class="findme"> p-tag content</p>
# this doesn't work
tag = '\'p\' class_=\'findme\''
# a source gets passed
print(test.find(sources[source]))
# >>> None
I am trying to split it up as suggested like this:
pattern = '"p", {"class": "findme"}'
tag = pattern.split(', ')
tag1 = tag[0]
filter = tag[1]
date = test.find(tag1, filter)
I don't get errors, just None for date. The problem is propably the content of tag1 and filter The debuger of pycharm gives me:
tag1 = '"p"'
filter = '{"class": "findme"}'
Printing them doesn't show these apostrophs. Is it possible to remove these apostrophs?
The first argument is a tag name, and your string doesn't contain that. BeautifulSoup (or Python, generally) won't parse out a string like that, it cannot guess that you put some arbitrary Python syntax in that value.
Separate out the components:
tag = 'p'
filter = {'class_': 'findme'}
test.find(tag, **filter)
Okay I got it, thanks again.
dic_date = {'source1': 'p, class:findme', other sources ...}
pattern = dic_date[source]
tag = pattern.split(', ')
if len(tag) is 2:
att = tag[1].split(':') # getting the attribute
att = {att[0]: att[1]} # building a dictionary for the attributes
date = soup.find(tag[0], att)
else:
date = soup.find(tag[0]) # if there is only a tag without an attribute
Well it doesn't look very nice but it's working :)
I'm continuing writing my twitter crawler and am running into more problems. Take a look at the code below:
from BeautifulSoup import BeautifulSoup
import re
import urllib2
url = 'http://mobile.twitter.com/NYTimesKrugman'
def gettweets(soup):
tags = soup.findAll('div', {'class' : "list-tweet"})#to obtain tweet of a follower
for tag in tags:
print tag.renderContents()
print ('\n\n')
def are_more_tweets(soup):#to check whether there is more than one page on mobile
links = soup.findAll('a', {'href': True}, {id: 'more_link'})
for link in links:
b = link.renderContents()
test_b = str(b)
if test_b.find('more'):
return True
else:
return False
def getnewlink(soup): #to get the link to go to the next page of tweets on twitter
links = soup.findAll('a', {'href': True}, {id : 'more_link'})
for link in links:
b = link.renderContents()
if str(b) == 'more':
c = link['href']
d = 'http://mobile.twitter.com' +c
return d
def checkforstamp(soup): # the parser scans a webpage to check if any of the tweets are
times = soup.findAll('a', {'href': True}, {'class': 'status_link'})
for time in times:
stamp = time.renderContents()
test_stamp = str(stamp)
if test_stamp.find('month'):
return True
else:
return False
response = urllib2.urlopen(url)
html = response.read()
soup = BeautifulSoup(html)
gettweets(soup)
stamp = checkforstamp(soup)
tweets = are_more_tweets(soup)
print 'stamp' + str(stamp)
print 'tweets' +str (tweets)
while (not stamp) and tweets:
b = getnewlink(soup)
print b
red = urllib2.urlopen(b)
html = red.read()
soup = BeautifulSoup(html)
gettweets(soup)
stamp = checkforstamp(soup)
tweets = are_more_tweets(soup)
print 'done'
The code works in the following way:
For a single user NYTimesKrugman
-I obtain all tweets on a single page(gettweets)
-provided more tweets exist(are more tweets) and that I haven't obtained a month of tweets yet(checkforstamp), I get the link for the next page of tweets
-I go to the next page of tweets (entering the while loop) and continue the process until one of the above conditions is violated
However, I have done extensive testing and determined that I am not actually able to enter the while loop. Rather, the program is not doing so. This is strange, because my code is written such that tweets are true and stamp should yield false. However, I'm getting the below results: I am truly baffled!
<div>
<span>
<strong>NYTimeskrugman</strong>
<span class="status">What Would I Have Done? <a rel="nofollow" href="http://nyti.ms/nHxb8L" target="_blank" class="twitter_external_link">http://nyti.ms/nHxb8L</a></span>
</span>
<div class="list-tweet-status">
3 days ago
</div>
<div class="list-tweet-actions">
</div>
</div>
stampTrue
tweetsTrue
done
>>>
If someone could help that'd be great. Why can I not get more than 1 page of tweets? Is my parsing in checkstamp being done incorrectly? Thanx.
if test_stamp.find('month'):
will evaluate to True if it doesn't find month, because it returns -1 when it doesn't find the substring. It would only evaluate to False here if month was at the beginning of the string, so its position was 0.
You need
if test_stamp.find('month') != -1:
or just
return test_stamp.find('month') != -1
Your checkforstamp function returns non-empty, defined strings:
return 'True'
So (not stamp) will always be false.
Change it to return booleans like are_more_tweets does:
return True
and it should be fine.
For reference, see the boolean operations documentation:
In the context of Boolean operations, and also when expressions are used by control flow statements, the following values are interpreted as false: False, None, numeric zero of all types, and empty strings and containers (including strings, tuples, lists, dictionaries, sets and frozensets). All other values are interpreted as true.
...
The operator not yields True if its argument is false, False otherwise.
Edit:
Same problem with the if test in checkforstamp. Since find('substr') returns -1 when the substring is not found, str.find('substr') in boolean context will be True if there is no match according to the rules above.
That is not the only place in your code where this problem appears. Please review all your tests.