BS HTML Parsing - & is ignored when printing URL strings - python

Consider the following example.
htmlist = ['<div class="portal" role="navigation" id="p-coll-print_export">',\
'<h3>Print/export</h3>',\
'<div class="body">',\
'<ul>',\
'<li id="coll-create_a_book">Create a book</li>',\
'<li id="coll-download-as-rl">Download as PDF</li>',\
'<li id="t-print">Printable version</li>',\
'</ul>',\
'</div>',\
'</div>',\
]
soup = __import__("bs4").BeautifulSoup("".join(htmlist), "html.parser")
for x in soup("a"):
print(x)
print(x.attrs)
print(soup.a.get_text())
I was expecting this short script to print the a tag equaling x, followed by a dictionary of the attributes of x (name (as key) and content (as key's value) of each of these), ending with the text for the link.
Instead the output is
Create a book
{'href': '/w/index.php?title=Special:Book&bookcmd=book_creator&referer=Main+Page'}
Create a book
Download as PDF
{'href': '/w/index.php?title=Special:Book&bookcmd=render_article&arttitle=Main+Page&oldid=560327612&writer=rl'}
Create a book
<a accesskey="p" href="/w/index.php?title=Main_Page&printable=yes" title="Printable version of this page [p]">Printable version</a>
{'href': '/w/index.php?title=Main_Page&printable=yes', 'title': 'Printable version of this page [p]', 'accesskey': ['p']}
Create a book
The issues I find with this output are:
print(soup.a.get_text()) bit always prints the text of the first tag.
In the dictionaries output by print(x.attrs), the value of the key "href" is missing &amp.
What am I missing here and how do I get the desired output?

You can use the cgi.escape or html.escape methods to html encode the & character.
import html
for x in soup("a"):
print(x)
print({k:html.escape(v, False) if k == 'href' else v for k,v in x.attrs.items()})
print(x.get_text())

Related

How to get twitter profile name using python BeautifulSoup module?

I'm trying to get twitter profile name using profile url with beautifulsoup in python,
but whatever html tags I use, I'm not able to get the name. What html tags can I use to
get the profile name from twitter user page ?
url = 'https://twitter.com/twitterID'
html = requests.get(url).text
soup = BeautifulSoup(html, 'html.parser')
# Find the display name
name_element = soup.find('span', {'class':'css-901oao css-16my406 r-poiln3 r-bcqeeo r-qvutc0'})
if name_element != None:
display_name = name_element.text
else:
display_name = "error"
html = requests.get(url).text
Twitter profile links cannot be scraped simply through requests like this since the contents of the profile pages are loaded with JavaScript [via the API], as you might notice if you previewed the source HTML on you browser's network logs or checked the fetched HTML.
name_element = soup.find('span', {'class':'css-901oao css-16my406 r-poiln3 r-bcqeeo r-qvutc0'})
display_name = name_element.text
Even after fetching the right HTML, calling .find like that will result in display_name containing 'To view keyboard shortcuts, press question mark' or 'Don’t miss what’s happening' because there are 67 span tags with that class. Calling .find_all(....)[6] might work but it's definitely not a reliable approach. You should instead consider using .select with CSS selectors to target the name.
name_element = soup.select_one('div[data-testid="UserName"] span>span')
The .find equivalent would be
# name_element = soup.find('div', {'data-testid': 'UserName'}).span.span ## too many weak points
name_element = soup.find(lambda t: t.name == t.parent.name == 'span' and t.find_parent('div', {'data-testid': 'UserName'}))
but I find .select much more convenient.
Selenium Example
Using two functions I often use for scraping - linkToSoup_selenium (which takes a URL and returns a BeautifulSoup object after using selenium and bs4 to load and parse the HTML), and selectForList (which extracts details from bs4 Tags based on selectors [like in the selectors dictionary below])
Setup:
# imports ## PASTE FROM https://pastebin.com/kEC9gPC8
# def linkToSoup_selenium... ## PASTE FROM https://pastebin.com/kEC9gPC8
# def selectForList... ## PASTE FROM https://pastebin.com/ZnZ7xM6u
## JUST FOR REDUCING WHITESPACE - not important for extracting information ##
def miniStr(o): return ' '.join(w for w in str(o).split() if w)
profileUrls = ['https://twitter.com/twitterID', 'https://twitter.com/jokowi', 'https://twitter.com/sep_colin']
# ptSel = 'article[data-testid="tweet"]:has(div[data-testid="socialContext"])'
# ptuaSel = 'div[data-testid="User-Names"]>div>div>div>a'
selectors = {
'og_url': ('meta[property="og\:url"][content]', 'content'),
'name_span': 'div[data-testid="UserName"] span>span',
'name_div': 'div[data-testid="UserName"]',
# 'handle': 'div[data-testid="UserName"]>div>div>div+div',
'description': 'div[data-testid="UserDescription"]',
# 'location': 'span[data-testid="UserLocation"]>span',
# 'url_href': ('a[data-testid="UserUrl"][href]', 'href'),
# 'url_text': 'a[data-testid="UserUrl"]>span',
# 'birthday': 'span[data-testid="UserBirthdate"]',
# 'joined': 'span[data-testid="UserJoinDate"]>span',
# 'following': 'div[data-testid="UserName"]~div>div>a[href$="\/following"]',
# 'followers': 'div[data-testid="UserName"]~div>div>a[href$="\/followers"]',
# 'pinnedTweet_uname': f'{ptSel} div[data-testid="User-Names"] span>span',
# 'pinnedTweet_handl': f'{ptSel} {ptuaSel}:not([aria-label])',
# 'pinnedTweet_pDate': (f'{ptSel} {ptuaSel}[aria-label]', 'aria-label'),
# 'pinnedTweet_text': f'{ptSel} div[data-testid="tweetText"]',
}
def scrapeTwitterProfile(profileUrl, selRef=selectors):
soup = linkToSoup_selenium(profileUrl, ecx=[
'div[data-testid="UserDescription"]' # wait for user description to load
# 'article[data-testid="tweet"]' # wait for tweets to load
], tmout=3, by_method='css', returnErr=True)
if not isinstance(soup, str): return selectForList(soup, selRef)
return {'Error': f'failed to scrape {profileUrl} - {soup}'}
Setting returnErr=True returns the error message (a string instead of the BeautifulSoup object) if anything goes wrong. ecx should be set based on which part/s you want to load (it's a list so it can have multiple selectors). tmout doesn't have to be passed (default is 25sec), but if it is, it should be adjusted for the other arguments and your own device and browser speeds - on my browser, tmo=0.01 is enough to load user details, but loading the first tweets takes at least tmo=2.
I wrote scrapeTwitterProfile mostly so that I could get tuDets [below] in one line. The for-loop after that is just for printing the results.
tuDets = [scrapeTwitterProfile(url) for url in profileUrls]
for url, d in zip(profileUrls, tuDets):
print('\nFrom', url)
for k, v in d.items(): print(f'\t{k}: {miniStr(v)}')
snscrape Example
snscrape has a module for Twitter that can be used to access Twitter data without having registered up for the API yourself. The example below prints a similar output to the previous example, but is faster.
# import snscrape.modules.twitter as sns_twitter
# def miniStr(o): return ' '.join(w for w in str(o).split() if w)
# profileIDs = [url.split('twitter.com/', 1)[-1].split('/')[0] for url in profileUrls]
profileIDs = ['twitterID', 'jokowi', 'sep_colin']
keysList = ['username', 'id', 'displayname', 'description', 'url']
for pid in profileIDs:
tusRes, defVal = sns_twitter.TwitterUserScraper(pid).entity, 'no such attribute'
print('\nfor ID', pid)
for k in keysList: print('\t', k, ':', miniStr(getattr(tusRes, k, defVal)))
You can get most of the attributes in .entity with .__dict__ or print them all all with something like
print('\n'.join(f'{a}: {miniStr(v)}' for a, v in [
(n, getattr(tusRes, n)) for n in dir(tusRes)
] if not (a[:1] == '_' or callable(v))))
See this example from this tutorial if you are interested in scraping tweets as well.

How can I scrape this page?

I'm scraping a page, but I'm getting errors trying to scrape WANTED-DATA
<td class="class-1" data-reactid="41"><a class="class-2" data-reactid="42" data-symbol="MORE-DATA" href="/quote/HKxlkPH4-x" title="WANTED-DATA">text</a></td>
The closer thing I can extract is text by doing:
getText.find('a', attrs={'class':'class-2'}).text
# output: 'text'
How can I scrape 'WANTED-DATA'?
try this one:
links = soup.findAll('a', attrs={'class':'class-2'}).text
for link in links:
title = link.get('title')
from the docs. You can write tag[attr_name] to get single attribute and tag.attrs lo get a dictionary of all attributes with values.
soup.find('a', attrs={'class':'class-2'})['title']
You could do it also like this:
html = """<td class="class-1" data-reactid="41"><a class="class-2" data-reactid="42" data-symbol="MORE-DATA" href="/quote/HKxlkPH4-x" title="WANTED-DATA">text</a></td>"""
soup = BeautifulSoup(html)
## adding title=True below prevent any error in case you have links without the 'title attribute'
titles = [x.get('title') for x in soup.find_all('a',title=True)]
print(titles)
Output:
['WANTED-DATA']

BeautifulSoup get links between strings

So I am using BS4 to get the following out of a Website:
<div>Some TEXT with some LINK
and some continuing TEXT with following some LINK inside.</div>
What I need to get is:
"Some TEXT with some LINK ("https// - actual Link") and some continuing TEXT with following some LINK ("https//- next Link") inside."
I am struggeling with this for some time now and don't know how to get there ... tried before, after, between, [:], all sort of in-Array-passing methods to get everything together.
I hope someone can help me with this because I am am new to Python. Thanks in advance.
You can use str.join with an iteration over soup.contents:
import bs4
html = '''<div>Some TEXT with some LINK and some continuing TEXT with following some LINK inside.</div>'''
result = ''.join(i if isinstance(i, bs4.element.NavigableString) else f'{i.text} ({i["href"]})' for i in bs4.BeautifulSoup(html, 'html.parser').div.contents)
Output:
'Some TEXT with some LINK (https// - actual Link) and some continuing TEXT with following some LINK (https//- next Link) inside.'
Edit: ignoring br tags:
html = '''<div>Some TEXT <br> with some LINK and some continuing TEXT with <br> following some LINK inside.</div>'''
result = ''.join(i if isinstance(i, bs4.element.NavigableString) else f'{i.text} ({i["href"]})' for i in bs4.BeautifulSoup(html, 'html.parser').div.contents \
if getattr(i, 'name', None) != 'br')
Edit 2: recursive solution:
def form_text(s):
if isinstance(s, (str, bs4.element.NavigableString)):
yield s
elif s.name == 'a':
yield f'{s.get_text(strip=True)} ({s["href"]})'
else:
for i in getattr(s, 'contents', []):
yield from form_text(i)
html = '''<div>Some TEXT <i>other text in </i> <br> with some LINK and some continuing TEXT with <br> following some LINK inside.</div>'''
print(' '.join(form_text(bs4.BeautifulSoup(html, 'html.parser'))))
Output:
Some TEXT other text in with some LINK (https// - actual Link) and some continuing TEXT with following some LINK (https//- next Link) inside.
Also, whitespace may become an issue due to the presence of br tags, etc. To work around this, you can use re.sub:
import re
result = re.sub('\s+', ' ', ' '.join(form_text(bs4.BeautifulSoup(html, 'html.parser'))))
Output:
'Some TEXT other text in with some LINK (https// - actual Link) and some continuing TEXT with following some LINK (https//- next Link) inside.'

BeautifulSoup - Python - Find the key from HTML

I have been practicing with bs4 and Python and now I have been stucked.
My plan is to do a If - Else state where I wanted to do similar like
If(I find a value inside this html)
Do This method
Else:
Do something else
and I have scraped up a html I found randomly which looks like -
<div class="Talkinghand" data-backing="ShowingHide" data-key="123456" data-theme="$MemeTheme" style=""></div>
and what I have done so far is that:
s = requests.Session()
Url = 'www.myhtml.com' #Just took a random page which I don't feel to insert
r = s.get(Url)
soup = soup(r, "lxml")
findKey = soup.find(('div', {'class': 'Talkinghand'})['data-key'])
print(findKey)
but no luck. Gives me error and
TypeError: object of type 'Response' has no len()
Once I find or print out the key. I wanted to do a if else statement where it also says:
If(there is a value inside that data-key)
...
To display the data-key attribute from inside the <div> tag, you can do the following:
from bs4 import BeautifulSoup
html = '<div class="Talkinghand" data-backing="ShowingHide" data-key="123456" data-theme="$MemeTheme" style=""></div>'
soup = BeautifulSoup(html, "html.parser")
print soup.div['data-key']
This would print:
123456
You would need to pass r.content to your soup call.
Your script had an extra ( and ), so the following would also work:
findKey = soup.find('div', {'class': 'Talkinghand'})['data-key']
print findKey

How to find text of <div><span>text</span></div> in beautifulsoup?

This is the HTML:
<div><div id="NhsjLK">
<li class="EditableListItem NavListItem FollowersNavItem NavItem not_removable">
Followers <span class="list_count">92</span></li></div></div>
I want to extract the text 92 and convert it into integer and print in python2. How can I?
Code:
i = soup.find('div', id='NhsjLK')
print "Followers :", i.find('span', id='list_count').text
I'd not go with getting it by the class directly, since I think "list_count" is too broad of a class value and might be used for other things on the page.
There are definitely several different options judging by this HTML snippet alone, but one of the nicest, from my point of you, is to use that "Followers" text/label and get the next sibling of it:
from bs4 import BeautifulSoup
data = """
<div><div id="NhsjLK">
<li class="EditableListItem NavListItem FollowersNavItem NavItem not_removable">
Followers <span class="list_count">92</span></li></div></div>"""
soup = BeautifulSoup(data, "html.parser")
count = soup.find(text=lambda text: text and text.startswith('Followers')).next_sibling.get_text()
count = int(count)
print(count)
Or, an another, a very concise and reliable approach would be to use the partial match (the *= part below) on the href value of the parent a element:
count = int(soup.select_one("a[href*=followers] .list_count").get_text())
Or, you might check the class value of the parent li element:
count = int(soup.select_one("li.FollowersNavItem .list_count").get_text())

Categories

Resources