BeautifulSoup: how to find all the about attributes from html string - python

In a text file, these items have the same structure and I would like to parse it with beautiful soup.
An extract:
data = """<text id="1" sig="prenatfra-camppres-2017-part01-viewEvent-1&docRefId-0&docName-news%C2%B720170425%C2%B7LC%C2%B7assignment_862852&docIndex-3_1" title="Éditorial élection présidentielle" author="NULL" year="2017" date="25/04/2017" section="NULL" sourcename="La Croix" sourcesig="LC" polarity="Positif" about="Le Pen|Macron">
<p type="title">Éditorial élection présidentielle</p>
</text>"""
soup = BeautifulSoup(data, 'html.parser') #
I know data are not really a pure html code.
I would like to extract all "about" section for example.
print(soup.find_all('about')) => it returns an empty array!
Perhaps I use a wrong parser?
Thanks a lot.
Best regards.
Théo

If you check the documentation carefully for find_all, it looks for tags with the specified name.
So in this case, you should look for the text tag(s) and then retrieve the about attribute from them.
A working example would look like this:
from bs4 import BeautifulSoup
data = """<text id="1" sig="prenatfra-camppres-2017-part01-viewEvent-1&docRefId-0&docName-news%C2%B720170425%C2%B7LC%C2%B7assignment_862852&docIndex-3_1" title="Éditorial élection présidentielle" author="NULL" year="2017" date="25/04/2017" section="NULL" sourcename="La Croix" sourcesig="LC" polarity="Positif" about="Le Pen|Macron">
<p type="title">Éditorial élection présidentielle</p>
</text>"""
soup = BeautifulSoup(data, 'html.parser')
# to get the 'about' attribute from the first text element
print(soup.find_all('text')[0]['about'])
# to get the 'about' attributes from all the text elements, as a list
print([text['about'] for text in soup.find_all('text')])
Output:
Le Pen|Macron
['Le Pen|Macron']

Related

How to get a html text inside tag using BeautifulSoup

How can I extract data from example HTML with beautifulsoup?
<Tag1>
<message code="able to extract text from here"/>
<text value="able to extract text that is here"/>
<htmlText><![CDATA[<p>some thing <lite>OR</lite>get exact data from here</p>]]></htmlText>
</Tag1>
I tried both .findall and .get_text, however I am not able to extract the text values from htmlText element.
Expected output:
some thing ORget exact data from here
You could use BeautifulSoup twice, first extract the htmlText element and then parse the contents. For example:
from bs4 import BeautifulSoup
import lxml
html = """
<Tag1>
<message code="able to extract text from here"/>
<text value="able to extract text that is here"/>
<htmlText><![CDATA[<p>some thing <lite>OR</lite>get exact data from here</p>]]></htmlText>
</Tag1>
"""
soup = BeautifulSoup(html, "lxml")
for tag1 in soup.find_all("tag1"):
cdata_html = tag1.htmltext.text
cdata_soup = BeautifulSoup(cdata_html, "lxml")
print(cdata_soup.p.text)
Which would display:
some thing ORget exact data from here
Note: lxml needs to also be installed using pip install lxml. BeautifulSoup will automatically import this.
Here's are the steps you need to make:
# firstly, select all "htmlText" elements
soup.select("htmlText")
# secondly, iterate over all of them
for result in soup.select("htmlText"):
# further code
# thirdly, use another BeautifulSoup() object to parse the data
# otherwise you can't access <p>, <lite> elements data
# since they are unreachable to first BeautifulSoup() object
for result in soup.select("htmlText"):
final = BeautifulSoup(result.text, "lxml")
# fourthly, grab all <p> elements AND their .text -> "p.text"
for result in soup.select("htmlText"):
final = BeautifulSoup(result.text, "lxml").p.text
Code and example in the online IDE (use the most readable):
from bs4 import BeautifulSoup
import lxml
html = """
<Tag1>
<message code="able to extract text from here"/>
<text value="able to extract text that is here"/>
<htmlText><![CDATA[<p>some thing <lite>OR</lite>get exact data from here</p>]]></htmlText>
</Tag1>
"""
soup = BeautifulSoup(html, "lxml")
# BeautifulSoup inside BeautifulSoup
unreadable_soup = BeautifulSoup(BeautifulSoup(html, "lxml").select_one('htmlText').text, "lxml").p.text
print(unreadable_soup)
example_1 = BeautifulSoup(soup.select_one('htmlText').text, "lxml").p.text
print(text_1)
# wihtout hardcoded list slices
for result in soup.select("htmlText"):
example_2 = BeautifulSoup(result.text, "lxml").p.text
print(example_2)
# or one liner
example_3 = ''.join([BeautifulSoup(result.text, "lxml").p.text for result in soup.select("htmlText")])
print(example_3)
# output
'''
some thing ORget exact data from here
some thing ORget exact data from here
some thing ORget exact data from here
some thing ORget exact data from here
'''

Python: BeautifulSoup Pulling/Parsing data from within html tag

I'm attempting to pull sporting data from a url using Beautiful Soup in Python code. The issue I'm having with this data source is the data appears within the html tag. Specifically this tag is titled ""
I'm after the players data - which seems to be in XML format. However this data is appearing within the "match" tag rather that as the content within the start/end tag.
So like this:
print(soup.match)
Returns: (not going to include all the text):
<match :matchdata='{"match":{"id":"5dbb8e20-6f37-11eb-924a-1f6b8ad68.....ALL DATA HERE....>
</match>
Because of this when I try to output the contents as text it returns empty.
print(soup.match.text)
Returns: nothing
How would I extract this data from within the "" html tag. After this I would like to either save as an XML file or even better a CSV file would be ideal.
My python program from the beginning is:
from bs4 import BeautifulSoup
import requests
url="___MY_URL_HERE___"
# Make a GET request for html content
html_content = requests.get(url).text
# Parse the html content
soup = BeautifulSoup(html_content, "lxml")
## type(soup)
## <class 'bs4.BeautifulSoup'>
print(soup.match)
Thanks a lot!
A tag may have any number of attributes. The tag has an attribute “id” whose value is “boldest”. You can access a tag’s attributes by treating the tag like a dictionary
https://www.crummy.com/software/BeautifulSoup/bs4/doc/#attributes
So in your case
print(soup.match[":matchdata"])

How to extract last modified date of a link using beautiful soup

Using this code:
soup = BeautifulSoup(requests.get(url).content, "html.parser")
Here is the data I extracted with Beautiful Soup:
<pre>Name Last modified Size</pre>
<hr/>
<pre>
../
0.1.0/
21-Oct-2020 14:06 -
</pre>
I am trying to get the 'Last Modified' data associated with the 'a' tag. In this example, I want to make a dict with the key being '0.1.0' (I know how to extract this) and the value being '21-Oct-2020 14:06'.
EDIT
OK, so after playing around I was able to get the text:
(Pdb) soup.findAll("pre")[1].get_text()
'../\n0.1.0/ 21-Oct-2020 14:06 -\n'
I guess just iterating around each 'pre' tag will do it
thx
You can use regex for that !!
import re
data = re.findall(r'\d{2}\-\w{3}\-\d{4} \d{2}\:\d{2}',requests.get(url))[0]

Parsing the html of the child element [BeautifulSoup]

I have only two weeks learning python.
I'm scraping an XML file and one of the elements of the loop [item->description], have HTML inside, how could I get the text inside p?
url="https://www.milenio.com/rss"
source=requests.get(url)
soup=BeautifulSoup(source.content, features="xml")
items=soup.findAll('item')
for item in items:
html_text=item.description
# This returns HTML code: <p>Paragraph 1</p> <p>Paragraph 2</p>
This next line could work, BUT I got some internal, external links and images, which isn't required.
desc=item.description.get_text()
So, if I make a loop o trying to get all the p, it doesn't work.
for p in html_text.find_all('p'):
print(p)
AttributeError: 'NoneType' object has no attribute 'find_all'
Thank you so much!
The issue is how bs4 processes CData (it's pretty well documented but not very solved).
You'll need to import CData from bs4 which will help extract the CData as a string and use the html.parser library, from there create a new bs4 object with that string to give it a findAll attribute and iterate over it's contents.
from bs4 import BeautifulSoup, CData
import requests
url="https://www.milenio.com/rss"
source=requests.get(url)
soup = BeautifulSoup(source.content, 'html.parser')
items=soup.findAll('item')
for item in items:
html_text = item.description
findCdata = html_text.find(text=lambda tag: isinstance(tag, CData))
newSoup = BeautifulSoup(findCdata, 'html.parser')
paragraphs = newSoup.findAll('p')
for p in paragraphs:
print(p.get_text())
Edit:
OP needed to extract link text and found that to only be possible inside the item loop using link = item.link.nextSibling because the link content was jumping outside of its tag like so </link>http://www.... In XML tree view this particular XML doc showed a drop down for the link element which is likely the cause.
To get content from other tags inside the document that don't show a dropdown in XML tree view and don't have nested CData convert the tag to lowercase and return the text as usual:
item.pubdate.get_text() # Gets contents the tag <pubDate>
item.author.get_text() # Gets contents of the tag <author>
this should look like this:
for item in items:
html_text=item.description #??
#!! dont use html_text.find_all !!
for p in item.find_all('p'):
print(p)

Get all HTML tags with Beautiful Soup

I am trying to get a list of all html tags from beautiful soup.
I see find all but I have to know the name of the tag before I search.
If there is text like
html = """<div>something</div>
<div>something else</div>
<div class='magical'>hi there</div>
<p>ok</p>"""
How would I get a list like
list_of_tags = ["<div>", "<div>", "<div class='magical'>", "<p>"]
I know how to do this with regex, but am trying to learn BS4
You don't have to specify any arguments to find_all() - in this case, BeautifulSoup would find you every tag in the tree, recursively.
Sample:
from bs4 import BeautifulSoup
html = """<div>something</div>
<div>something else</div>
<div class='magical'>hi there</div>
<p>ok</p>
"""
soup = BeautifulSoup(html, "html.parser")
print([tag.name for tag in soup.find_all()])
# ['div', 'div', 'div', 'p']
print([str(tag) for tag in soup.find_all()])
# ['<div>something</div>', '<div>something else</div>', '<div class="magical">hi there</div>', '<p>ok</p>']
Please try the below--
for tag in soup.findAll(True):
print(tag.name)
I thought I'd share my solution to a very similar question for those that find themselves here, later.
Example
I needed to find all tags quickly but only wanted unique values. I'll use the Python calendar module to demonstrate.
We'll generate an html calendar then parse it, finding all and only those unique tags present.
The below structure is very similar to the above, using set comprehensions:
from bs4 import BeautifulSoup
import calendar
html_cal = calendar.HTMLCalendar().formatmonth(2020, 1)
set(tag.name for tag in BeautifulSoup(html_cal, 'html.parser').find_all())
# Result
# {'table', 'td', 'th', 'tr'}
If you want to find some specific HTML tags then try this:
html = driver.page_source
# driver.page_source: "<div>something</div>\n<div>something else</div>\n<div class='magical'>hi there</div>\n<p>ok</p>\n"
soup = BeautifulSoup(html)
for tag in soup.find_all(['a','div']): # Mention HTML tag names here.
print(tag.text)
# Result:
# something
# something else
# hi there
Here is an efficient function that I use to parse different HTML and text documents:
def parse_docs(path, format, tags):
"""
Parse the different files in path, having html or txt format, and extract the text content.
Returns a list of strings, where every string is a text document content.
:param path: str
:param format: str
:param tags: list
:return: list
"""
docs = []
if format == "html":
for document in tqdm(get_list_of_files(path)):
# print(document)
soup = BeautifulSoup(open(document, encoding='utf-8').read())
text = '\n'.join([''.join(s.findAll(text=True)) for s in
soup.findAll(tags)]) # parse all <p>, <div>, and <h> tags
docs.append(text)
else:
for document in tqdm(get_list_of_files(path)):
text = open(document, encoding='utf-8').read()
docs.append(text)
return docs
a simple call: parse_docs('/path/to/folder', 'html', ['p', 'h', 'div']) will return a list of text strings.

Categories

Resources