While Loops Problem with Python Twitter Crawler

While Loops Problem with Python Twitter Crawler - python

I'm continuing writing my twitter crawler and am running into more problems. Take a look at the code below:
from BeautifulSoup import BeautifulSoup
import re
import urllib2
url = 'http://mobile.twitter.com/NYTimesKrugman'
def gettweets(soup):
tags = soup.findAll('div', {'class' : "list-tweet"})#to obtain tweet of a follower
for tag in tags:
print tag.renderContents()
print ('\n\n')
def are_more_tweets(soup):#to check whether there is more than one page on mobile
links = soup.findAll('a', {'href': True}, {id: 'more_link'})
for link in links:
b = link.renderContents()
test_b = str(b)
if test_b.find('more'):
return True
else:
return False
def getnewlink(soup): #to get the link to go to the next page of tweets on twitter
links = soup.findAll('a', {'href': True}, {id : 'more_link'})
for link in links:
b = link.renderContents()
if str(b) == 'more':
c = link['href']
d = 'http://mobile.twitter.com' +c
return d
def checkforstamp(soup): # the parser scans a webpage to check if any of the tweets are
times = soup.findAll('a', {'href': True}, {'class': 'status_link'})
for time in times:
stamp = time.renderContents()
test_stamp = str(stamp)
if test_stamp.find('month'):
return True
else:
return False
response = urllib2.urlopen(url)
html = response.read()
soup = BeautifulSoup(html)
gettweets(soup)
stamp = checkforstamp(soup)
tweets = are_more_tweets(soup)
print 'stamp' + str(stamp)
print 'tweets' +str (tweets)
while (not stamp) and tweets:
b = getnewlink(soup)
print b
red = urllib2.urlopen(b)
html = red.read()
soup = BeautifulSoup(html)
gettweets(soup)
stamp = checkforstamp(soup)
tweets = are_more_tweets(soup)
print 'done'
The code works in the following way:
For a single user NYTimesKrugman
-I obtain all tweets on a single page(gettweets)
-provided more tweets exist(are more tweets) and that I haven't obtained a month of tweets yet(checkforstamp), I get the link for the next page of tweets
-I go to the next page of tweets (entering the while loop) and continue the process until one of the above conditions is violated
However, I have done extensive testing and determined that I am not actually able to enter the while loop. Rather, the program is not doing so. This is strange, because my code is written such that tweets are true and stamp should yield false. However, I'm getting the below results: I am truly baffled!
<div>
<span>
<strong>NYTimeskrugman</strong>
<span class="status">What Would I Have Done? <a rel="nofollow" href="http://nyti.ms/nHxb8L" target="_blank" class="twitter_external_link">http://nyti.ms/nHxb8L</a></span>
</span>
<div class="list-tweet-status">
3 days ago
</div>
<div class="list-tweet-actions">
</div>
</div>
stampTrue
tweetsTrue
done
>>>
If someone could help that'd be great. Why can I not get more than 1 page of tweets? Is my parsing in checkstamp being done incorrectly? Thanx.

if test_stamp.find('month'):
will evaluate to True if it doesn't find month, because it returns -1 when it doesn't find the substring. It would only evaluate to False here if month was at the beginning of the string, so its position was 0.
You need
if test_stamp.find('month') != -1:
or just
return test_stamp.find('month') != -1

Your checkforstamp function returns non-empty, defined strings:
return 'True'
So (not stamp) will always be false.
Change it to return booleans like are_more_tweets does:
return True
and it should be fine.
For reference, see the boolean operations documentation:
In the context of Boolean operations, and also when expressions are used by control flow statements, the following values are interpreted as false: False, None, numeric zero of all types, and empty strings and containers (including strings, tuples, lists, dictionaries, sets and frozensets). All other values are interpreted as true.
...
The operator not yields True if its argument is false, False otherwise.
Edit:
Same problem with the if test in checkforstamp. Since find('substr') returns -1 when the substring is not found, str.find('substr') in boolean context will be True if there is no match according to the rules above.
That is not the only place in your code where this problem appears. Please review all your tests.

Related

How to get twitter profile name using python BeautifulSoup module?

I'm trying to get twitter profile name using profile url with beautifulsoup in python,
but whatever html tags I use, I'm not able to get the name. What html tags can I use to
get the profile name from twitter user page ?
url = 'https://twitter.com/twitterID'
html = requests.get(url).text
soup = BeautifulSoup(html, 'html.parser')
# Find the display name
name_element = soup.find('span', {'class':'css-901oao css-16my406 r-poiln3 r-bcqeeo r-qvutc0'})
if name_element != None:
display_name = name_element.text
else:
display_name = "error"

html = requests.get(url).text
Twitter profile links cannot be scraped simply through requests like this since the contents of the profile pages are loaded with JavaScript [via the API], as you might notice if you previewed the source HTML on you browser's network logs or checked the fetched HTML.
name_element = soup.find('span', {'class':'css-901oao css-16my406 r-poiln3 r-bcqeeo r-qvutc0'})
display_name = name_element.text
Even after fetching the right HTML, calling .find like that will result in display_name containing 'To view keyboard shortcuts, press question mark' or 'Don’t miss what’s happening' because there are 67 span tags with that class. Calling .find_all(....)[6] might work but it's definitely not a reliable approach. You should instead consider using .select with CSS selectors to target the name.
name_element = soup.select_one('div[data-testid="UserName"] span>span')
The .find equivalent would be
# name_element = soup.find('div', {'data-testid': 'UserName'}).span.span ## too many weak points
name_element = soup.find(lambda t: t.name == t.parent.name == 'span' and t.find_parent('div', {'data-testid': 'UserName'}))
but I find .select much more convenient.
Selenium Example
Using two functions I often use for scraping - linkToSoup_selenium (which takes a URL and returns a BeautifulSoup object after using selenium and bs4 to load and parse the HTML), and selectForList (which extracts details from bs4 Tags based on selectors [like in the selectors dictionary below])
Setup:
# imports ## PASTE FROM https://pastebin.com/kEC9gPC8
# def linkToSoup_selenium... ## PASTE FROM https://pastebin.com/kEC9gPC8
# def selectForList... ## PASTE FROM https://pastebin.com/ZnZ7xM6u
## JUST FOR REDUCING WHITESPACE - not important for extracting information ##
def miniStr(o): return ' '.join(w for w in str(o).split() if w)
profileUrls = ['https://twitter.com/twitterID', 'https://twitter.com/jokowi', 'https://twitter.com/sep_colin']
# ptSel = 'article[data-testid="tweet"]:has(div[data-testid="socialContext"])'
# ptuaSel = 'div[data-testid="User-Names"]>div>div>div>a'
selectors = {
'og_url': ('meta[property="og\:url"][content]', 'content'),
'name_span': 'div[data-testid="UserName"] span>span',
'name_div': 'div[data-testid="UserName"]',
# 'handle': 'div[data-testid="UserName"]>div>div>div+div',
'description': 'div[data-testid="UserDescription"]',
# 'location': 'span[data-testid="UserLocation"]>span',
# 'url_href': ('a[data-testid="UserUrl"][href]', 'href'),
# 'url_text': 'a[data-testid="UserUrl"]>span',
# 'birthday': 'span[data-testid="UserBirthdate"]',
# 'joined': 'span[data-testid="UserJoinDate"]>span',
# 'following': 'div[data-testid="UserName"]~div>div>a[href$="\/following"]',
# 'followers': 'div[data-testid="UserName"]~div>div>a[href$="\/followers"]',
# 'pinnedTweet_uname': f'{ptSel} div[data-testid="User-Names"] span>span',
# 'pinnedTweet_handl': f'{ptSel} {ptuaSel}:not([aria-label])',
# 'pinnedTweet_pDate': (f'{ptSel} {ptuaSel}[aria-label]', 'aria-label'),
# 'pinnedTweet_text': f'{ptSel} div[data-testid="tweetText"]',
}
def scrapeTwitterProfile(profileUrl, selRef=selectors):
soup = linkToSoup_selenium(profileUrl, ecx=[
'div[data-testid="UserDescription"]' # wait for user description to load
# 'article[data-testid="tweet"]' # wait for tweets to load
], tmout=3, by_method='css', returnErr=True)
if not isinstance(soup, str): return selectForList(soup, selRef)
return {'Error': f'failed to scrape {profileUrl} - {soup}'}
Setting returnErr=True returns the error message (a string instead of the BeautifulSoup object) if anything goes wrong. ecx should be set based on which part/s you want to load (it's a list so it can have multiple selectors). tmout doesn't have to be passed (default is 25sec), but if it is, it should be adjusted for the other arguments and your own device and browser speeds - on my browser, tmo=0.01 is enough to load user details, but loading the first tweets takes at least tmo=2.
I wrote scrapeTwitterProfile mostly so that I could get tuDets [below] in one line. The for-loop after that is just for printing the results.
tuDets = [scrapeTwitterProfile(url) for url in profileUrls]
for url, d in zip(profileUrls, tuDets):
print('\nFrom', url)
for k, v in d.items(): print(f'\t{k}: {miniStr(v)}')
snscrape Example
snscrape has a module for Twitter that can be used to access Twitter data without having registered up for the API yourself. The example below prints a similar output to the previous example, but is faster.
# import snscrape.modules.twitter as sns_twitter
# def miniStr(o): return ' '.join(w for w in str(o).split() if w)
# profileIDs = [url.split('twitter.com/', 1)[-1].split('/')[0] for url in profileUrls]
profileIDs = ['twitterID', 'jokowi', 'sep_colin']
keysList = ['username', 'id', 'displayname', 'description', 'url']
for pid in profileIDs:
tusRes, defVal = sns_twitter.TwitterUserScraper(pid).entity, 'no such attribute'
print('\nfor ID', pid)
for k in keysList: print('\t', k, ':', miniStr(getattr(tusRes, k, defVal)))
You can get most of the attributes in .entity with .__dict__ or print them all all with something like
print('\n'.join(f'{a}: {miniStr(v)}' for a, v in [
(n, getattr(tusRes, n)) for n in dir(tusRes)
] if not (a[:1] == '_' or callable(v))))
See this example from this tutorial if you are interested in scraping tweets as well.

Comparing results with Beautiful Soup in Python

I've got the following code that filters a particular search on an auction site.
I can display the titles of each value & also the len of all returned values:
from bs4 import BeautifulSoup
import requests
url = requests.get("https://www.trademe.co.nz/a/marketplace/music-instruments/instruments/guitar-bass/electric-guitars/search?search_string=prs&condition=used")
soup = BeautifulSoup(url.text, "html.parser")
listings = soup.findAll("div", attrs={"class":"tm-marketplace-search-card__title"})
print(len(listings))
for listing in listings:
print(listing.text)
This prints out the following:
#print(len(listings))
3
#for listing in listings:
# print(listing.text)
PRS. Ten Top Custom 24, faded Denim, Piezo.
PRS SE CUSTOM 22
PRS Tremonti SE *With Seymour Duncan Pickups*
I know what I want to do next, but don't know how to code it. Basically I want to only display new results. I was thinking storing the len of the listings (3 at the moment) as a variable & then comparing that with another GET request (2nd variable) that maybe runs first thing in the morning. Alternatively compare both text values instead of the len. If it doesn't match, then it shows the new listings. Is there a better or different way to do this? Any help appreciated thank you

With length-comparison, there is the issue of some results being removed between checks, so it might look like there are no new results even if there are; and text-comparison does not account for results with similar titles.
I can suggest 3 other methods. (The 3rd uses my preferred approach.)
Closing time
A comment suggested using the closing time, which can be found in the tag before the title; you can define a function to get the days until closing
from datetime import date
import dateutil.parser
def get_days_til_closing(lSoup):
cTxt = lSoup.previous_sibling.find('div', {'tmid':'closingtime'}).text
cTime = dateutil.parser.parse(cTxt.replace('Closes:', '').strip())
return (cTime.date() - date.today()).days
and then filter by the returned value
min_dtc = 3 # or as preferred
# your current code upto listings = soup.findAll....
new_listings = [l for l in listings if get_days_til_closing(l) > min_dtc]
print(len(new_listings), f'new listings [of {len(listings)}]')
for listing in new_listings: print(listing.text)
However, I don't know if sellers are allowed to set their own closing times or if they're set at a fixed offset; also, I don't see the closing time text when inspecting with the browser dev tools [even though I could extract it with the code above], and that makes me a bit unsure of whether it's always available.
JSON list of Listing IDs
Each result is in a "card" with a link to the relevant listing, and that link contains a number that I'm calling the "listing ID". You can save that in a list as a JSON file and keep checking against it every new scrape
from bs4 import BeautifulSoup
import requests
import json
lFilename = 'listing_ids.json' # or as preferred
url = requests.get("https://www.trademe.co.nz/a/marketplace/music-instruments/instruments/guitar-bass/electric-guitars/search?search_string=prs&condition=used")
try:
prev_listings = json.load(open(lFilename, 'r'))
except Exception as e:
prev_listings = []
print(len(prev_listings), 'saved listings found')
soup = BeautifulSoup(url.text, "html.parser")
listings = soup.select("div.o-card > a[href*='/listing/']")
new_listings = [
l for l in listings if
l.get('href').split('/listing/')[1].split('?')[0]
not in prev_listings
]
print(len(new_listings), f'new listings [of {len(listings)}]')
for listing in new_listings:
print(listing.select_one('div.tm-marketplace-search-card__title').text)
with open(lFilename, 'w') as f:
json.dump(prev_listings + [
l.get('href').split('/listing/')[1].split('?')[0]
for l in new_listings
], f)
This should be fairly reliable as long as they don't tend to recycle the listing ids, this should be fairly reliable. (Even then, every once in a while, after checking the new listings for that day, you can just delete the JSON file and re-run the program once; it will also keep the file from getting too big...)
CSV Logging [including Listing IDs]
Instead of just saving the IDs, you can save pretty much all the details from each result
from bs4 import BeautifulSoup
import requests
from datetime import date
import pandas
lFilename = 'listings.csv' # or as preferred
max_days = 60 # or as preferred
date_today = date.today()
url = requests.get("https://www.trademe.co.nz/a/marketplace/music-instruments/instruments/guitar-bass/electric-guitars/search?search_string=prs&condition=used")
try:
prev_listings = pandas.read_csv(lFilename).to_dict(orient='records')
prevIds = [str(l['listing_id']) for l in prev_listings]
except Exception as e:
prev_listings, prevIds = [], []
print(len(prev_listings), 'saved listings found')
def get_listing_details(lSoup, prevList, lDate=date_today):
selectorsRef = {
'title': 'div.tm-marketplace-search-card__title',
'location_time': 'div.tm-marketplace-search-card__location-and-time',
'footer': 'div.tm-marketplace-search-card__footer',
}
lId = lSoup.get('href').split('/listing/')[1].split('?')[0]
lDets = {'listing_id': lId}
for k, sel in selectorsRef.items():
s = lSoup.select_one(sel)
lDets[k] = None if s is None else s.text
lDets['listing_link'] = 'https://www.trademe.co.nz/a/' + lSoup.get('href')
lDets['new_listing'] = lId not in prevList
lDets['last_scraped'] = lDate.isoformat()
return lDets
soup = BeautifulSoup(url.text, "html.parser")
listings = [
get_listing_details(s, prevIds) for s in
soup.select("div.o-card > a[href*='/listing/']")
]
todaysIds = [l['listing_id'] for l in listings]
new_listings = [l for l in listings if l['new_listing']]
print(len(new_listings), f'new listings [of {len(listings)}]')
for listing in new_listings: print(listing['title'])
prev_listings = [
p for p in prev_listings if str(p['listing_id']) not in todaysIds
and (date_today - date.fromisoformat(p['last_scraped'])).days < max_days
]
pandas.DataFrame(prev_listings + listings).to_csv(lFilename, index=False)
You'll end up with a spreadsheet of scraping history/log that you can check anytime, and depending on what you set max_days to, the oldest data will be automatically cleared.

Fixed it with the following:
allGuitars = ["",]
latestGuitar = soup.select("#-title")[0].text.strip()
if latestGuitar in allGuitars[0]:
print("No change. The latest listing is still: " + allGuitars[0])
elif not latestGuitar in allGuitars[0]:
print("New listing detected! - " + latestGuitar)
allGuitars.clear()
allGuitars.insert(0, latestGuitar)

Extracting text section from (Edgar 10-K filings) HTML

I am trying to extract a certain section from HTML-files. To be specific, I look for the "ITEM 1" Section of the 10-K filings (a US business reports of a certain company). E.g.:
https://www.sec.gov/Archives/edgar/data/1591890/000149315218003887/form10-k.htm#a_002
Problem: However, I am not able to find the "ITEM 1" section, nor do I have an idea how to tell my algorithm to search from that point "ITEM 1" to another point (e.g. "ITEM 1A") and extract the text in between.
I am super thankful for any help.
Among others, I have tried this (and similar), but my bd is always empty:
try:
# bd = soup.body.findAll(text=re.compile('^ITEM 1$'))
# bd = soup.find_all(name="ITEM 1")
# bd = soup.find_all(["ITEM 1", "ITEM1", "Item 1", "Item1", "item 1", "item1"])
print(" Business Section (Item 1): ", bd.content)
except:
print("\n Section not found!")
Using Python 3.7 and Beautifulsoup4
Regards Heka

As I mentioned in a comment, because of the nature of EDGAR, this may work on one filing but fail on another. The principles, though, should generally work (after some adjustments...)
import requests
import lxml.html
url = 'https://www.sec.gov/Archives/edgar/data/1591890/000149315218003887/form10-k.htm#a_002'
source = requests.get(url)
doc = lxml.html.fromstring(source.text)
tabs = doc.xpath('//table[./tr/td/font/a[#name="a_002"]]/following-sibling::p/font')
#in this filing, Item 1 is hiding in a series of <p> tags following a table with an <a> tag with a
#"name" attribute which has a value of "a_002"
flag = ''
for i in tabs:
if flag == 'stop':
break
if i.text is not None: #we now start extracting the text from each <p> tag and move to the next
print(i.text_content().strip().replace('\n',''))
nxt = i.getparent().getnext()
#the following detects when the <p> tags of Item 1 end and the next Item begins and then stops
if str(type(nxt)) != "<class 'NoneType'>" and nxt.tag == 'table':
for j in nxt.iterdescendants():
if j.tag == 'a' and j.values()[0]=='a_003':
# we have encountered the <a> tag with a "name" attribute which has a value of "a_003", indicated the beginning of the next Item; so we stop
flag='stop'
The output is the text of Item 1 in this filing.

There are special characters. Remove them first
import requests
from simplified_scrapy.simplified_doc import SimplifiedDoc
html = requests.get('https://www.sec.gov/Archives/edgar/data/1591890/000149315218003887/form10-k.htm#a_002').text
doc = SimplifiedDoc(html)
doc.loadHtml(doc.replaceReg(doc.html, 'ITEM[\s]+','ITEM '))
item1 = doc.getElementByText('ITEM 1')
print(item1) # {'tag': 'B', 'html': 'ITEM 1. BUSINESS'}
# Here's what you might use
table = item1.getParent('TABLE')
trs = table.TRs
for tr in trs:
print (tr.TDs)
If you use the latest version, you can use the following methods
import requests
from simplified_scrapy.simplified_doc import SimplifiedDoc
html = requests.get('https://www.sec.gov/Archives/edgar/data/1591890/000149315218003887/form10-k.htm#a_002').text
doc = SimplifiedDoc(html)
item1 = doc.getElementByReg('ITEM[\s]+1') # Incoming regex
print(item1,item1.text) # {'tag': 'B', 'html': 'ITEM\n 1. BUSINESS'} ITEM 1. BUSINESS
# Here's what you might use
table = item1.getParent('TABLE')
trs = table.TRs
for tr in trs:
print (tr.TDs)

Python/BeautifulSoup Web-scraping Issue--Irregular Returns

I've scoured the questions/answers and have attempted to implement changes to the following, but to no avail.
I'm trying to scrape pages of course listings from Coursera's "Data Analysis" results, https://www.coursera.org/browse/data-science/data-analysis?languages=en&page=1.
There are 9 pages, each with 25 courses, and each course is under its own <h2> tag. I've found some success with the following code, but it has not been consistent:
courses_data_sci = []
for i in range(10):
page = "https://www.coursera.org/browse/data-science/data-analysis? languages=en&page=" + str(i)
html = urlopen(page)
soup = BeautifulSoup(html.read(), "html.parser")
for meta in soup.find_all('div', {'id' : 'rendered-content'}):
for x in range(26):
try:
course = meta.find_all('h2')[x].text.strip()
courses_data_sci.append(course)
except IndexError:
pass
This code seems to return the first 2-3 pages of results and the last page of results; sometimes, if I run it again after clearning courses_data_sci, it will return the 4th page of results a few times. (I'm working in Jupyter, and I've restarted the kernel to account for any issues there.)
I'm not sure why the code isn't working correctly, let alone why it is returning inconsistent results.
Any help is appreciated. Thank you.
UPDATE
Thanks for the ideas...I am trying to utilize both to make the code work.
Just out of curiosity, I pared down the code to see what it was picking up, with both comments in mind.
courses_data_sci = []
session = requests.Session()
for i in range(10):
page = "https://www.coursera.org/browse/data-science/data-analysis? languages=en&page=" + str(i)
html = urlopen(page)
soup = BeautifulSoup(html.read(), "html.parser")
for meta in soup.find_all('div', {'id' : 'rendered-content'}):
course = meta.find_all('h2')
courses_data_sci.append(course)
# This is to check length of courses_data_sci across pages
print('Page: %s -- total length %s' % (i, len(courses_data_sci)))
This actually results in a list of lists, which does contain all the courses throughout the 9 pages (and, of course, the href info since it isn't being stripped yet). Each loop creates one list per page: a list of all the courses on the respective page. So it appears that I should be able to strip the href while the lists are being pushed to the list, courses_data_sci.
There are 2 <h2> tags per course, so I'm also thinking there could be an issue with the second range() call: for x in range(26). I've tried multiple different ranges, none of which work or which return an error, "index out of range".

I get the same behaviour using your code.
I changed it in order to use requests:
from bs4 import BeautifulSoup
import requests
courses_data_sci = []
session = requests.Session()
for i in range(10):
page = "https://www.coursera.org/browse/data-science/data-analysis?languages=en&page=" + str(i)
html = session.get(page)
soup = BeautifulSoup(html.text, "html.parser")
for meta in soup.find_all('div', {'id' : 'rendered-content'}):
for x in range(26):
try:
course = meta.find_all('h2')[x].text.strip()
courses_data_sci.append(course)
except IndexError:
pass
# This is to check length of courses_data_sci across pages
print('Page: %s -- total length %s' % (i, len(courses_data_sci)))

How to get span value using python,BeautifulSoup

I am using BeautifulSoup for the first time and trying to collect several data such as email,phone number, and mailing address from a soup object.
Using regular expressions, I can identify the email address. My code to find the email is:
def get_email(link):
mail_list = []
for i in link:
a = str(i)
email_pattern = re.compile("<a\s+href=\"mailto:([a-zA-Z0-9._#]*)\">", re.IGNORECASE)
ik = re.findall(email_pattern, a)
if (len(ik) == 1):
mail_list.append(i)
else:
pass
s_email = str(mail_list[0]).split('<a href="')
t_email = str(s_email[1]).split('">')
print t_email[0]
Now, I also need to collect the phone number, mailing address and web url. I think in BeautifulSoup there must be an easy way to find those specific data.
A sample html page is as below:
<ul>
<li>
<span>Email:</span>
Message Us
</li>
<li>
<span>Website:</span>
<a target="_blank" href="http://www.abcl.com">Visit Our Website</a>
</li>
<li>
<span>Phone:</span>
(123)456-789
</li>
</ul>
And using BeatifulSoup, I am trying to collect the span values of Email, website and Phone.
Thanks in advance.

The most obvious problem with your code is that you're turning the object representing the link back into HTML and then parsing it with a regular expression again - that ignores much of the point of using BeautifulSoup in the first place. You might need to use a regular expression to deal with the contents of the href attribute, but that's it. Also, the else: pass is unnecessary - you can just leave it out entirely.
Here's some code that does something like what you want, and might be a useful starting point:
from BeautifulSoup import BeautifulSoup
import re
# Assuming that html is your input as a string:
soup = BeautifulSoup(html)
all_contacts = []
def mailto_link(e):
'''Return the email address if the element is is a mailto link,
otherwise return None'''
if e.name != 'a':
return None
for key, value in e.attrs:
if key == 'href':
m = re.search('mailto:(.*)',value)
if m:
return m.group(1)
return None
for ul in soup.findAll('ul'):
contact = {}
for li in soup.findAll('li'):
s = li.find('span')
if not (s and s.string):
continue
if s.string == 'Email:':
a = li.find(mailto_link)
if a:
contact['email'] = mailto_link(a)
elif s.string == 'Website:':
a = li.find('a')
if a:
contact['website'] = a['href']
elif s.string == 'Phone:':
contact['phone'] = unicode(s.nextSibling).strip()
all_contacts.append(contact)
print all_contacts
That will produce a list of one dictionary per contact found, in this case that will just be:
[{'website': u'http://www.abcl.com', 'phone': u'(123)456-789', 'email': u'abc#gmail.com'}]

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

While Loops Problem with Python Twitter Crawler - python

Related

How to get twitter profile name using python BeautifulSoup module?

Comparing results with Beautiful Soup in Python

Extracting text section from (Edgar 10-K filings) HTML

Python/BeautifulSoup Web-scraping Issue--Irregular Returns

How to get span value using python,BeautifulSoup

Categories

Resources