How to extract a dictionary/list from BeautifulSoup - python

With the following line of code, I can extract a list-like line with beautifulsoup
Code:
section = soup.find("div", {"class": "listing-col col-sm-16 col-md-12 col-lg-13 col"})
for span in section.select('div.carListing--textCol2'):
print(span.select_one('shortlist-directive[ng-init]')['ng-init'])
Where the output yields a list-esque dictionary-esque line
Output:
setCurrentListingIdSrp('10856566'); setGAEventDataSrp({"ss_cg_listing_id":10856566,"listingid":10856566,"make":"Audi","model":"A4","transmission":"Manual","body_type":"Wagon","location":"the moon, SolarSystem","Kms":"12,469 km","featured":"No","seller_type":"USED Dealer ad","ss_cg_products":"V"});
Question:
How can I extract setGAEventDataSrp as a Python dictionary?
What I have tried but didn't work:
Not the most Pythonic way.
for span in section.select('div.carListing--textCol2'):
data_string = dict(str(span.select_one('shortlist-directive[ng-init]')['ng-init'].split('setGAEventDataSrp(')[-1][:-2]))

Use regular expression.
import re
import json
html='''setCurrentListingIdSrp('10856566'); setGAEventDataSrp({"ss_cg_listing_id":10856566,"listingid":10856566,"make":"Audi","model":"A4","transmission":"Manual","body_type":"Wagon","location":"the moon, SolarSystem","Kms":"12,469 km","featured":"No","seller_type":"USED Dealer ad","ss_cg_products":"V"});'''
output=re.findall('\{.*?}',html)[0]
json=json.loads(output)
print(json)
Just replace the html to span.select_one('shortlist-directive[ng-init]')['ng-init']

You can use json.loads
>>> import json
>>> type(json.loads('{"a":1, "b":"w"}'))
<class 'dict'>
And
data_string = json.loads(str(span.select_one('shortlist-directive[ng-init]')['ng-init']).split('setGAEventDataSrp(')[-1][:-2])

Related

Python - How to extract a number from a bs4 output

I am trying to get a price from a website using BeautifulSoup and so far I have managed to get:
<h2>£<!-- -->199.99</h2>
I just want to receive '£199.99'
Is there a way to filter out the letters?
Thanks in advance
You will use get_text function with strip=True to clean if necessary
from bs4 import BeautifulSoup
html = '<h2>£<!-- -->199.99</h2>'
soup = BeautifulSoup(html,'html5lib')
result = soup.find('h2').get_text(strip=True)
print(result)
#£199.99
Use re?
import re
s = "<h2>£<!-- -->199.99</h2>"
rx_price = re.compile(r'([0-9.]+)')
content = re.sub(r'<.+?>', '', s)
print (f"£{rx_price.findall(content)[0]}")
Output:
£199.99

Get element's text with CDATA

Say, I have an element:
>>> el = etree.XML('<tag><![CDATA[content]]></tag>')
>>> el.text
'content'
What I'd like to get is <![CDATA[content]]>. How can I go about it?
When you do el.text, that's always going to give you the plain text content.
To see the serialized element try tostring() instead:
el = etree.XML('<tag><![CDATA[content]]></tag>')
print(etree.tostring(el).decode())
this will print:
<tag>content</tag>
To preserve the CDATA, you need to use an XMLParser() with strip_cdata=False:
parser = etree.XMLParser(strip_cdata=False)
el = etree.XML('<tag><![CDATA[content]]></tag>', parser=parser)
print(etree.tostring(el).decode())
This will print:
<tag><![CDATA[content]]></tag>
This should be sufficient to fulfill your "I want to make sure in a test that content is wrapped in CDATA" requirement.
You might consider using BeautifulSoup and look for CDATA instances:
import bs4
from bs4 import BeautifulSoup
data='''<tag><![CDATA[content]]></tag>'''
soup = BeautifulSoup(data, 'html.parser')
"<![CDATA[{}]]>".format(soup.find(text=lambda x: isinstance(x, bs4.CData)))
Output
<![CDATA[content]]>

is there any convenient way to get a index of sub section in a page?

it is convenient to use "index-x" to quick locate a sub section in a page.
for instance
https://docs.python.org/3/library/re.html#index-2
gives 3rd sub-section in this page.
when i want to share the location of a sub-section to others, how to get the index in a convenient way?
for instance, how to get the index of {m,n} sub-section without counting from index-0?
With bs4 4.7.1 you can use :has and :contains to target a specific text string and return the index (note that using select_one will return first match. Use a list comprehension and select if want to return all matches
import requests
from bs4 import BeautifulSoup as bs
r = requests.get('https://docs.python.org/3/library/re.html')
soup = bs(r.content, 'lxml')
index = soup.select_one('dl:has(.pre:contains("{m,n}"))')['id']
print(index)
Any version: if you want a dictionary that maps special characters to indices. Thanks to #zoe for spotting the error in my dictionary comprehension.
import requests
from bs4 import BeautifulSoup as bs
r = requests.get('https://docs.python.org/3/library/re.html')
soup = bs(r.content, 'lxml')
mappings = dict([(item['id'], [i.text for i in item.select('dt .pre')]) for item in soup.select('[id^="index-"]')])
indices = {i: k for (k, v) in mappings.items() for i in v}
You're looking for index-7.
You can download the HTML of the page and get all the possible values of index-something with the following code:
import re
import requests
from bs4 import BeautifulSoup
r = requests.get('https://docs.python.org/3/library/re.html')
soup = BeautifulSoup(r.content.decode())
result = [t['id'] for t in soup.find_all(id=re.compile('index-\d+'))]
print(result)
Output:
['index-0', 'index-1', 'index-2', 'index-3', 'index-4', 'index-5', 'index-6', 'index-7', 'index-8', 'index-9', 'index-10', 'index-11', 'index-12', 'index-13', 'index-14', 'index-15', 'index-16', 'index-17', 'index-18', 'index-19', 'index-20', 'index-21', 'index-22', 'index-23', 'index-24', 'index-25', 'index-26', 'index-27', 'index-28', 'index-29', 'index-30', 'index-31', 'index-32', 'index-33', 'index-34', 'index-35', 'index-36', 'index-37', 'index-38']
The t objects in the list comprehension contain the HTML of the tags whose id matches the regex.

Extract number from a website using beautifulsoup?

The following python code:
from bs4 import BeautifulSoup
div = '<div class="hm"><span class="xg1">查看:</span> 15660<span class="pipe">|</span><span class="xg1">回复:</span> 435</div>'
soup = BeautifulSoup(div, "lxml")
hm = soup.find("div", {"class": "hm"})
print(hm)
The output that i want two number in this case:
15660
435
I want to try to extract the numbers from the website using beautifulsoup. But I do not know how to do it?
Call soup.find_all, with a regex -
>>> list(map(str.strip, soup.find_all(text=re.compile(r'\b\d+\b'))))
Or,
>>> [x.strip() for x in soup.find_all(text=re.compile(r'\b\d+\b'))]
['15660', '435']
If you need integers instead of strings, call int inside the list comprehension -
>>> [int(x.strip()) for x in soup.find_all(text=re.compile(r'\b\d+\b'))]
[15660, 435]

Hashtags python html

I want to extract all the hashtags from a given website:
For example, "I love #stack overflow because #people are very #helpful!"
This should pull the 3 hashtags into a table.
In the website I am targeting there is a table with a #tag description
So we can find #love this hashtag speaks about love
This is my work:
#import the library used to query a website
import urllib2
#specify the url
wiki = "https://www.symplur.com/healthcare-hashtags/tweet-chats/all"
#Query the website and return the html to the variable 'page'
page = urllib2.urlopen(wiki)
#import the Beautiful soup functions to parse the data returned from the
website
from bs4 import BeautifulSoup
#Parse the html in the 'page' variable, and store it in Beautiful Soup
format
soup = BeautifulSoup(page, "lxml")
print soup.prettify()
s = soup.get_text()
import re
re.findall("#(\w+)", s)
I have an issues in the output :
The first one is that the output look like this :
[u'eeeeee',
u'333333',
u'222222',
u'222222',
u'222222',
u'222222',
u'222222',
u'222222',
u'222222',
u'AASTGrandRoundsacute'
The output concatenate the Hashtag with the first word in the description. If I compare to the example I evoked before the output is 'lovethis'.
How can I do to extract only the one word after the hashtag.
Thank you
I think there's no need to use regex to parse the text you get from the page, you can use BeautifulSoup itself for that. I'm using Python3.6 in the code below, just to show the entire code, but the important line is hashtags = soup.findAll('td', {'id':'tweetchatlist_hashtag'}). Notice all hashtags in the table have td tag and id attribute = tweetchatlist_hashtag, so calling .findAll is the way to go here:
import requests
import re
from bs4 import BeautifulSoup
wiki = "https://www.symplur.com/healthcare-hashtags/tweet-chats/all"
page = requests.get(wiki).text
soup = BeautifulSoup(page, "lxml")
hashtags = soup.findAll('td', {'id':'tweetchatlist_hashtag'})
Now let's have a look at the first item of our list:
>>> hashtags[0]
<td id="tweetchatlist_hashtag" itemprop="location">#AASTGrandRounds</td>
So we see that what we really want is the value of title attribute of a:
>>> hashtags[0].a['title']
'#AASTGrandRounds'
To proceed to get a list of all hashtags using list comprehension:
>>> lst = [hashtag.a['title'] for hashtag in hashtags]
If you are not used with list comprehension syntax, the line above is similar to this:
>>> lst = []
>>> for hashtag in hashtags:
lst.append(hashtag.a['title'])
lst then is the desired output, see the first 20 items of the list:
>>> lst[:20]
['#AASTGrandRounds', '#abcDrBchat', '#addictionchat', '#advocacychat', '#AetnaMyHealthy', '#AlzChat', '#AnatQ', '#anzOTalk', '#AskAvaility', '#ASPChat', '#ATtalk', '#autchat', '#AXSChat', '#ayacsm', '#bcceu', '#bccww', '#BCSM', '#benurse', '#BeTheDifference', '#bioethx']

Categories

Resources