Get element's text with CDATA - python

Say, I have an element:
>>> el = etree.XML('<tag><![CDATA[content]]></tag>')
>>> el.text
'content'
What I'd like to get is <![CDATA[content]]>. How can I go about it?

When you do el.text, that's always going to give you the plain text content.
To see the serialized element try tostring() instead:
el = etree.XML('<tag><![CDATA[content]]></tag>')
print(etree.tostring(el).decode())
this will print:
<tag>content</tag>
To preserve the CDATA, you need to use an XMLParser() with strip_cdata=False:
parser = etree.XMLParser(strip_cdata=False)
el = etree.XML('<tag><![CDATA[content]]></tag>', parser=parser)
print(etree.tostring(el).decode())
This will print:
<tag><![CDATA[content]]></tag>
This should be sufficient to fulfill your "I want to make sure in a test that content is wrapped in CDATA" requirement.

You might consider using BeautifulSoup and look for CDATA instances:
import bs4
from bs4 import BeautifulSoup
data='''<tag><![CDATA[content]]></tag>'''
soup = BeautifulSoup(data, 'html.parser')
"<![CDATA[{}]]>".format(soup.find(text=lambda x: isinstance(x, bs4.CData)))
Output
<![CDATA[content]]>

Related

Change scraped output

I have a loop putting URLs into my broswer and scraping its content, generating this output:
2PRACE,0.0014
Hispanic,0.1556
API,0.0688
Black,0.0510
AIAN,0.0031
White,0.7200
The code looks like this:
f1 = open('urlz.txt','r',encoding="utf8")
ethnicity_urls = f1.readlines()
f1.close()
from urllib import request
from bs4 import BeautifulSoup
import time
import openpyxl
import pprint
for each in ethnicity_urls:
time.sleep(1)
scraped = request.urlopen(each)
soup = BeautifulSoup(scraped)
soup1 = soup.select('p')
print(soup1)
resultFile = open('results.csv','a')
resultFile.write(pprint.pformat(soup1))
resultFile.close()
My problem is quite simple yet I do not find any tool that helps me achieve it. I would like to change the output from a list with "\n" in it to this:
2PRACE,0.0014 Hispanic,0.1556 API,0.0688 Black,0.0510 AIAN,0.0031 White,0.7200
I did not succeed by using replace as it told me I am treating a number of elements the same as a single element.
My approach here was:
for each in ethnicity_urls:
time.sleep(1)
scraped = request.urlopen(each)
soup = BeautifulSoup(scraped)
soup1 = soup.select('p')
soup2 = soup1.replace('\n',' ')
print(soup2)
resultFile = open('results.csv','a')
resultFile.write(pprint.pformat(soup2))
resultFile.close()
Can you help me find the correct approach to mutate the output before writing it to a csv?
The error message I get:
AttributeError: ResultSet object has no attribute 'replace'. You're probably treating a list of elements like a single element. Did you call find_all() when you meant to call find()?
See the solution to the problem in my answer below. Thanks for all the responses!
soup1 seems to be an iterable, so you cannot just call replace on it.
Instead you could loop through all string items in soup1 and then call replace for every single one of them and then save the changes string to your soup2 variable. Something like this:
for e in soup1:
soup2.append(e.replace('\n',' '))
You need to iterate over the soup.
Soup is a list of elements:
The BS4 Documentation is excellent and has many many examples:
https://www.crummy.com/software/BeautifulSoup/bs4/doc/
Use strip() to remove the \n
for x in soup1:
for r in x.children:
try:
print(r.strip())
except TypeError:
pass
Thank you both for the ideas and resources. I think I could implement what you suggested. The current build is
for each in ethnicity_urls:
time.sleep(1)
scraped = request.urlopen(each)
soup = BeautifulSoup(scraped)
soup1 = soup.select('p')
for e in soup1:
soup2 = str(soup1)
soup2 = soup2.replace('\n','')
print(soup2)
resultFile = open('results.csv','a')
resultFile.write(pprint.pformat(soup2))
resultFile.close()
And works just fine. I can do the final adjustments now in excel.

How to extract a dictionary/list from BeautifulSoup

With the following line of code, I can extract a list-like line with beautifulsoup
Code:
section = soup.find("div", {"class": "listing-col col-sm-16 col-md-12 col-lg-13 col"})
for span in section.select('div.carListing--textCol2'):
print(span.select_one('shortlist-directive[ng-init]')['ng-init'])
Where the output yields a list-esque dictionary-esque line
Output:
setCurrentListingIdSrp('10856566'); setGAEventDataSrp({"ss_cg_listing_id":10856566,"listingid":10856566,"make":"Audi","model":"A4","transmission":"Manual","body_type":"Wagon","location":"the moon, SolarSystem","Kms":"12,469 km","featured":"No","seller_type":"USED Dealer ad","ss_cg_products":"V"});
Question:
How can I extract setGAEventDataSrp as a Python dictionary?
What I have tried but didn't work:
Not the most Pythonic way.
for span in section.select('div.carListing--textCol2'):
data_string = dict(str(span.select_one('shortlist-directive[ng-init]')['ng-init'].split('setGAEventDataSrp(')[-1][:-2]))
Use regular expression.
import re
import json
html='''setCurrentListingIdSrp('10856566'); setGAEventDataSrp({"ss_cg_listing_id":10856566,"listingid":10856566,"make":"Audi","model":"A4","transmission":"Manual","body_type":"Wagon","location":"the moon, SolarSystem","Kms":"12,469 km","featured":"No","seller_type":"USED Dealer ad","ss_cg_products":"V"});'''
output=re.findall('\{.*?}',html)[0]
json=json.loads(output)
print(json)
Just replace the html to span.select_one('shortlist-directive[ng-init]')['ng-init']
You can use json.loads
>>> import json
>>> type(json.loads('{"a":1, "b":"w"}'))
<class 'dict'>
And
data_string = json.loads(str(span.select_one('shortlist-directive[ng-init]')['ng-init']).split('setGAEventDataSrp(')[-1][:-2])

Hashtags python html

I want to extract all the hashtags from a given website:
For example, "I love #stack overflow because #people are very #helpful!"
This should pull the 3 hashtags into a table.
In the website I am targeting there is a table with a #tag description
So we can find #love this hashtag speaks about love
This is my work:
#import the library used to query a website
import urllib2
#specify the url
wiki = "https://www.symplur.com/healthcare-hashtags/tweet-chats/all"
#Query the website and return the html to the variable 'page'
page = urllib2.urlopen(wiki)
#import the Beautiful soup functions to parse the data returned from the
website
from bs4 import BeautifulSoup
#Parse the html in the 'page' variable, and store it in Beautiful Soup
format
soup = BeautifulSoup(page, "lxml")
print soup.prettify()
s = soup.get_text()
import re
re.findall("#(\w+)", s)
I have an issues in the output :
The first one is that the output look like this :
[u'eeeeee',
u'333333',
u'222222',
u'222222',
u'222222',
u'222222',
u'222222',
u'222222',
u'222222',
u'AASTGrandRoundsacute'
The output concatenate the Hashtag with the first word in the description. If I compare to the example I evoked before the output is 'lovethis'.
How can I do to extract only the one word after the hashtag.
Thank you
I think there's no need to use regex to parse the text you get from the page, you can use BeautifulSoup itself for that. I'm using Python3.6 in the code below, just to show the entire code, but the important line is hashtags = soup.findAll('td', {'id':'tweetchatlist_hashtag'}). Notice all hashtags in the table have td tag and id attribute = tweetchatlist_hashtag, so calling .findAll is the way to go here:
import requests
import re
from bs4 import BeautifulSoup
wiki = "https://www.symplur.com/healthcare-hashtags/tweet-chats/all"
page = requests.get(wiki).text
soup = BeautifulSoup(page, "lxml")
hashtags = soup.findAll('td', {'id':'tweetchatlist_hashtag'})
Now let's have a look at the first item of our list:
>>> hashtags[0]
<td id="tweetchatlist_hashtag" itemprop="location">#AASTGrandRounds</td>
So we see that what we really want is the value of title attribute of a:
>>> hashtags[0].a['title']
'#AASTGrandRounds'
To proceed to get a list of all hashtags using list comprehension:
>>> lst = [hashtag.a['title'] for hashtag in hashtags]
If you are not used with list comprehension syntax, the line above is similar to this:
>>> lst = []
>>> for hashtag in hashtags:
lst.append(hashtag.a['title'])
lst then is the desired output, see the first 20 items of the list:
>>> lst[:20]
['#AASTGrandRounds', '#abcDrBchat', '#addictionchat', '#advocacychat', '#AetnaMyHealthy', '#AlzChat', '#AnatQ', '#anzOTalk', '#AskAvaility', '#ASPChat', '#ATtalk', '#autchat', '#AXSChat', '#ayacsm', '#bcceu', '#bccww', '#BCSM', '#benurse', '#BeTheDifference', '#bioethx']

Python BeautifulSoup extracting text from result

I am trying to get the text from contents but when i try beautiful soup functions on the result variable it results in errors.
from bs4 import BeautifulSoup as bs
import requests
webpage = 'http://www.dictionary.com/browse/coypu'
r = requests.get(webpage)
page_text = r.text
soup = bs(page_text, 'html.parser')
result = soup.find_all('meta', attrs={'name':'description'})
print (result.get['contents'])
I am trying to get the result to read;
"Coypu definition, a large, South American, aquatic rodent, Myocastor (or Myopotamus) coypus, yielding the fur nutria. See more."
soup.find_all() returns a list. Since in your case, it returns only one element in the list, you can do:
>>> type(result)
<class 'bs4.element.ResultSet'>
>>> type(result[0])
<class 'bs4.element.ResultSet'>
>>> result[0].get('content')
Coypu definition, a large, South American, aquatic rodent, Myocastor (or Myopotamus) coypus, yielding the fur nutria. See more.
When you only want the first or a single tag use find, find_all returns a list/resultSet:
result = soup.find('meta', attrs={'name':'description'})["contents"]
You can also use a css selector with select_one:
result = soup.select_one('meta[name=description]')["contents"]
you need not to use findall as only by using find you can get desired output'
from bs4 import BeautifulSoup as bs
import requests
webpage = 'http://www.dictionary.com/browse/coypu'
r = requests.get(webpage)
page_text = r.text
soup = bs(page_text, 'html.parser')
result = soup.find('meta', {'name':'description'})
print result.get('content')
it will print:
Coypu definition, a large, South American, aquatic rodent, Myocastor (or Myopotamus) coypus, yielding the fur nutria. See more.

How would one remove the CDATA tags from but preserve the actual data in Python using LXML or BeautifulSoup

I have some XML I am parsing in which I am using BeautifulSoup as the parser. I pull the CDATA out with the following code, but I only want the data and not the CDATA TAGS.
myXML = open("c:\myfile.xml", "r")
soup = BeautifulSoup(myXML)
data = soup.find(text=re.compile("CDATA"))
print data
<![CDATA[TEST DATA]]>
What I would like to see if the following output:
TEST DATA
I don't care if the solution is in LXML or BeautifulSoup. Just want the best or easiest way to get the job done. Thanks!
Here is a solution:
parser = etree.XMLParser(strip_cdata=False)
root = etree.parse(self.param1, parser)
data = root.findall('./config/script')
for item in data: # iterate through list to find text contained in elements containing CDATA
print item.text
Based on the lxml docs:
>>> from lxml import etree
>>> parser = etree.XMLParser(strip_cdata=False)
>>> root = etree.XML('<root><data><![CDATA[test]]></data></root>', parser)
>>> data = root.findall('data')
>>> for item in data: # iterate through list to find text contained in elements containing CDATA
print item.text
test # just the text of <![CDATA[test]]>
This might be the best way to get the job done, depending on how amenable your xml structure is to this approach.
Based on BeautifulSoup:
>>> str='<xml> <MsgType><![CDATA[text]]></MsgType> </xml>'
>>> soup=BeautifulSoup(str, "xml")
>>> soup.MsgType.get_text()
u'text'
>>> soup.MsgType.string
u'text'
>>> soup.MsgType.text
u'text'
As the result, it just print the text from msgtype;

Categories

Resources