I have only two weeks learning python.
I'm scraping an XML file and one of the elements of the loop [item->description], have HTML inside, how could I get the text inside p?
url="https://www.milenio.com/rss"
source=requests.get(url)
soup=BeautifulSoup(source.content, features="xml")
items=soup.findAll('item')
for item in items:
html_text=item.description
# This returns HTML code: <p>Paragraph 1</p> <p>Paragraph 2</p>
This next line could work, BUT I got some internal, external links and images, which isn't required.
desc=item.description.get_text()
So, if I make a loop o trying to get all the p, it doesn't work.
for p in html_text.find_all('p'):
print(p)
AttributeError: 'NoneType' object has no attribute 'find_all'
Thank you so much!
The issue is how bs4 processes CData (it's pretty well documented but not very solved).
You'll need to import CData from bs4 which will help extract the CData as a string and use the html.parser library, from there create a new bs4 object with that string to give it a findAll attribute and iterate over it's contents.
from bs4 import BeautifulSoup, CData
import requests
url="https://www.milenio.com/rss"
source=requests.get(url)
soup = BeautifulSoup(source.content, 'html.parser')
items=soup.findAll('item')
for item in items:
html_text = item.description
findCdata = html_text.find(text=lambda tag: isinstance(tag, CData))
newSoup = BeautifulSoup(findCdata, 'html.parser')
paragraphs = newSoup.findAll('p')
for p in paragraphs:
print(p.get_text())
Edit:
OP needed to extract link text and found that to only be possible inside the item loop using link = item.link.nextSibling because the link content was jumping outside of its tag like so </link>http://www.... In XML tree view this particular XML doc showed a drop down for the link element which is likely the cause.
To get content from other tags inside the document that don't show a dropdown in XML tree view and don't have nested CData convert the tag to lowercase and return the text as usual:
item.pubdate.get_text() # Gets contents the tag <pubDate>
item.author.get_text() # Gets contents of the tag <author>
this should look like this:
for item in items:
html_text=item.description #??
#!! dont use html_text.find_all !!
for p in item.find_all('p'):
print(p)
Related
I am using beautifulsoup to delete an element from xml document. It is deleting required tag but also removing some other info from xml document which is not related to that element. How to stop this?
Code to reproduce:
import requests
from bs4 import BeautifulSoup
text_file = open('C:\Ashok\sample.xml', 'r')
s = text_file.read()
soup = BeautifulSoup(s, 'xml')
u = soup.find('Version', text='29.2.3')
fed = u.findParent()
fed.decompose()
f = open('C:\Ashok\sample.xml', "w")
f.write(str(soup))
f.close()
Find comparison attached. deleted other info showed in red rectangles.
It is updating Header and footer tags which I did not ask code to do.
What happens?
The empty elements are not deleted only notation is transformed.
Empty elements in XML
An element with no content is empty and in XML, you can indicate an empty element like this:
<element></element>
An alternativ notation is the so called self-closing tag:
<element />
Both forms have identical results in XML readers, parsers,...
I'm attempting to pull sporting data from a url using Beautiful Soup in Python code. The issue I'm having with this data source is the data appears within the html tag. Specifically this tag is titled ""
I'm after the players data - which seems to be in XML format. However this data is appearing within the "match" tag rather that as the content within the start/end tag.
So like this:
print(soup.match)
Returns: (not going to include all the text):
<match :matchdata='{"match":{"id":"5dbb8e20-6f37-11eb-924a-1f6b8ad68.....ALL DATA HERE....>
</match>
Because of this when I try to output the contents as text it returns empty.
print(soup.match.text)
Returns: nothing
How would I extract this data from within the "" html tag. After this I would like to either save as an XML file or even better a CSV file would be ideal.
My python program from the beginning is:
from bs4 import BeautifulSoup
import requests
url="___MY_URL_HERE___"
# Make a GET request for html content
html_content = requests.get(url).text
# Parse the html content
soup = BeautifulSoup(html_content, "lxml")
## type(soup)
## <class 'bs4.BeautifulSoup'>
print(soup.match)
Thanks a lot!
A tag may have any number of attributes. The tag has an attribute “id” whose value is “boldest”. You can access a tag’s attributes by treating the tag like a dictionary
https://www.crummy.com/software/BeautifulSoup/bs4/doc/#attributes
So in your case
print(soup.match[":matchdata"])
In a text file, these items have the same structure and I would like to parse it with beautiful soup.
An extract:
data = """<text id="1" sig="prenatfra-camppres-2017-part01-viewEvent-1&docRefId-0&docName-news%C2%B720170425%C2%B7LC%C2%B7assignment_862852&docIndex-3_1" title="Éditorial élection présidentielle" author="NULL" year="2017" date="25/04/2017" section="NULL" sourcename="La Croix" sourcesig="LC" polarity="Positif" about="Le Pen|Macron">
<p type="title">Éditorial élection présidentielle</p>
</text>"""
soup = BeautifulSoup(data, 'html.parser') #
I know data are not really a pure html code.
I would like to extract all "about" section for example.
print(soup.find_all('about')) => it returns an empty array!
Perhaps I use a wrong parser?
Thanks a lot.
Best regards.
Théo
If you check the documentation carefully for find_all, it looks for tags with the specified name.
So in this case, you should look for the text tag(s) and then retrieve the about attribute from them.
A working example would look like this:
from bs4 import BeautifulSoup
data = """<text id="1" sig="prenatfra-camppres-2017-part01-viewEvent-1&docRefId-0&docName-news%C2%B720170425%C2%B7LC%C2%B7assignment_862852&docIndex-3_1" title="Éditorial élection présidentielle" author="NULL" year="2017" date="25/04/2017" section="NULL" sourcename="La Croix" sourcesig="LC" polarity="Positif" about="Le Pen|Macron">
<p type="title">Éditorial élection présidentielle</p>
</text>"""
soup = BeautifulSoup(data, 'html.parser')
# to get the 'about' attribute from the first text element
print(soup.find_all('text')[0]['about'])
# to get the 'about' attributes from all the text elements, as a list
print([text['about'] for text in soup.find_all('text')])
Output:
Le Pen|Macron
['Le Pen|Macron']
I am trying to build a function in a python webscraper that moves to the next page in a list of results. I am having trouble locating the element in beautiful soup as the link is found at the end of many other tags, and doesn't have any attributes such as class or ID.
Here is a snippet of the html:
<a href="http://www.url?&=page=2">
Next
</a>
I have been reading the bs4 documentation trying to understand how I can extract the URL, but I am coming up stumped. I am thinking that it could be done by either:
finding the last .a['href'] in the parent element, as it is always the last one.
finding the href based on the fact that it always has text of 'Next'
I don't know how to write something that would solve either 1. or 2.
Am I along the right lines? Does anyone have any suggestions to achieve my goal? Thanks
To find <a> tag that contains text Next, you can do:
from bs4 import BeautifulSoup
txt = '''
<a href="http://www.url?&=page=2">
Next
</a>'''
soup = BeautifulSoup(txt, 'html.parser')
print(soup.select_one('a:contains("Next")')['href'])
Prints:
http://www.url?&=page=2
Or:
print(soup.find('a', text=lambda t: t.strip() == 'Next')['href'])
To get the last <a> tag inside some element, you can index the ResultSet with [-1]:
from bs4 import BeautifulSoup
txt = '''
<div id="block">
Some other link
Next
</div>
'''
soup = BeautifulSoup(txt, 'html.parser')
print(soup.select('div#block > a')[-1]['href'])
I am trying to parse a webpage (forums.macrumors.com) and get a list of all the threads posted.
So I have got this so far:
import urllib2
import re
from BeautifulSoup import BeautifulSoup
address = "http://forums.macrumors.com/forums/os/"
website = urllib2.urlopen(address)
website_html = website.read()
text = urllib2.urlopen(address).read()
soup = BeautifulSoup(text)
Now the webpage source has this code at the start of each thread:
<li id="thread-1880" class="discussionListItem visible sticky WikiPost "
data-author="ABCD">
How do I parse this so I can then get to the thread link within this li tag? Thanks for the help.
So from your code here, you have the soup object which contains the BeautifulSoup object of your html. The question is what part of that tag you're looking for is static? Is the id always the same? the class?
Finding by the id:
my_li = soup.find('li', {'id': 'thread-1880'})
Finding by the class:
my_li = soup.find('li', {'class': 'discussionListItem visible sticky WikiPost "})
Ideally you would figure out the unique class you can check for and use that instead of a list of classes.
if you are expecting an a tag inside of this object, you can do this to check:
if my_li and my_li.a:
print my_li.a.attrs.get('href')
I always recommend checking though, because if the my_li ends up being None or there is no a inside of it, your code will fail.
For more details, check out the BeautifulSoup documentation
http://www.crummy.com/software/BeautifulSoup/bs4/doc/
The idea here would be to use CSS selectors and to get the a elements inside the h3 with class="title" inside the div with class="titleText" inside the li element having the id attribute starting with "thread":
for link in soup.select("div.discussionList li[id^=thread] div.titleText h3.title a[href]"):
print link["href"]
You can tweak the selector further, but this should give you a good starting point.