Extract Page Intro Info with Beautiful Soup - python

I am new to Beautiful Soup and I am trying to extract the information that appears on a page. This info is contained in the div class="_50f3", and depending on the user it can contain multiple info (studies, studied, works, worked, lives, etc). So, far I have managed though the following code to parse the div classes, but I don't know how to extract the information I want from that..
table = soup.findAll('div', {'class': '_50f3'})
[<div class="_50f3">Lives in <a class="profileLink" data-hovercard="/ajax/hovercard/page.php?id=114148045261892" href="/Fort-Worth-Texas/114148045261892?ref=br_rs">Fort Worth, Texas</a></div>,
<div class="_50f3">From <a class="profileLink" data-hovercard="/ajax/hovercard/page.php?id=111762725508574" href="/Dallas-Texas/111762725508574?ref=br_rs">Dallas, Texas</a></div>]
For example, in the above I would like to store "Lives in" : "Fort Worth, Texas" and "From": "Dallas, Texas". But in the most general case I would like to store whatever info there is in there.
Any help greatly appreciated!

In general case this is just get_text() you need - it would construct a single element text string recursively going through the child nodes:
table = soup.find_all('div', {'class': '_50f3'})
print([item.get_text(strip=True) for item in table])
But, you can also extract the labels and values separately:
d = {}
for item in table:
label = item.find(text=True)
value = label.next_sibling
d[label.strip()] = value.get_text()
print(d)
Prints:
{'From': 'Dallas, Texas', 'Lives in': 'Fort Worth, Texas'}

for i in range(len(table)):
print(table[i].text)
Should Work

Related

how to remove span tag and class name after scrapping whereas i want to scrape only text using python

for link in soup.findAll('li'):
if "c-listing__authors-list" in str(link):
# theAuthor = link.string
theAuthor = str(link).replace("</p>","")
theAuthor = theAuthor.split("</span>")[1]
listAuthor.append(theAuthor)[Output][1]
Try to use get_text(strip=True) to get your goal:
for e in soup.select('li span.c-listing__authors-list'):
theAuthor = e.get_text(strip=True)
or to get a list in one line:
theAuthor = [e.get_text(strip=True) for e in soup.select('li span.c-listing__authors-list')]
Example
from bs4 import BeautifulSoup
html='''
<ul>
<li><span class="c-listing__authors-list">a</span></li>
<li><span class="c-listing__authors-list">b</span></li>
<li><span>no list</span></li>
</ul>
'''
soup = BeautifulSoup(html)
theAuthor = []
for e in soup.select('li span.c-listing__authors-list'):
theAuthor.append(e.get_text(strip=True))
Output
['a', 'b']
This answer is Microsoft (.Net) centric but I'm hoping it may help point you in the right direction.
Its been a while since I've created a scraper. But I'm thinking this is possible if you also know your XPath as I recall being able to read a webpage into a HTMLDocument, accessing the element you require using XPath then obtaining the text value of it.

best way to pair list of titles with a separate list of their corresponding links? (bs4)

final edit: so here's the solution -
list_c = [[x, y] for x, y in zip(titleList, linkList)]
Original post: I used bs4 to scrape a recipe website where the title to each recipe is not saved within the link tag. so I've extracted the titles of the recipes from one part of the code, and extracted the links from the other part and I've got these two lists (recipes, links) but I'm not sure the best way to pair each title to its corresponding link.
(The end goal is to have the titles be hyperlinked in an HTML file that I will put on my eventual recipe aggregator website).
I was considering saving them to a dictionary as key value pairs, or something else(?), so that I can call them into the HTML doc later on.
suggestions?
EDIT:
here's the code, works fine
soup = BeautifulSoup(htmlText, 'lxml')
links = soup.find_all('article')
linkList = []
titleList = []
for link in links[0:12]:
hyperL = link.find('header', class_ = 'entry-header').a['href']
linkList.append(hyperL)
for title in links:
x = title.get('aria-label')
titleList.append(x)
linkList prints out something like
['www.recipe.com/ham', 'www.recipe.com/curry', 'www.recipe.com/etc']
and
titleList is ['Ham', 'Curry', 'etc']
I want to print a list from these 2 like this:
[['Ham', 'www.recipe.com/ham'],['Curry', 'www.recipe.com/curry']]
The final goal for my website, I would want to have the following for each pair:
<a href='www.recipe.com/ham'>Ham<a/>
If you only anticipate looking up titles, and then using the result links, dictionaries are great for that.

Getting the text value from cells of a table when scraping recursive structure

I'm looking for a function that given a pair os HTML tags, returns the text inside them. Ideally I would like it to be recursive:
Examples:
Given
Asset management
returns
Asset management
Given
<p>Recursive Asset management</p>
returns
Recursive Asset management
Given
<p>Again Asset management</p>
returns
Again Asset management
Here is the code I have:
list_of_table_rows = tbl.findAll('tr')
for tr in list_of_table_rows[1:]:
th_list = tr.find("th")
td_list = tr.find("td")
if th_list is None or td_list is None:
continue
th_str = th.text
td_str = td.contents
# NOW THE PROBLEM IS td_str IS A LIST OF A BUNCH OF THINGS.
#PLAIN TEXT, BR TAG, LINKS, PARAGRAPHS, ETC.
#I WANT TO BE ABLE TO GET THAT PLAIN TEXT FOR LINKS AND PARAGRAPHS
for element in td_str:
if element == "<br/":
continue
# here...
The input should be a String, not a Tag or any other object. My trouble is the recursion.
UPDATE: This is an example of the data I am actually working with. The goal is to pull information from Wikipedia Infoboxes. The problem is some of the information in the Infobox are links or paragraphs. For example, in this page: https://en.wikipedia.org/wiki/Goldman_Sachs
<tr><th scope="row" style="padding-right:0.5em;">Founders</th><td
class="agent" style="line-height:1.35em;">Marcus Goldman .
<br /><a href="/wiki/Samuel_Sachs" title="Samuel Sachs">Samuel
Sachs</a></td></tr><tr>
Let's say we want to find who the Founders are. I only want the text in the elements. In this case a list containing Marcus Goldman and Samuel Sachs. I have also tried read_html from Pandas, but that concatenates the strings together and I don't want that to happen (its output is "Marcus GoldmanSamuel Sachs")
Here's an example of using .findChildren. It's not the full solution, but you can possibly use this to add on to #Bitto Bennichan solution
import bs4
html = '''<tr><th scope="row" style="padding-right:0.5em;">Founders</th><td
class="agent" style="line-height:1.35em;">Marcus Goldman .
<br /><a href="/wiki/Samuel_Sachs" title="Samuel Sachs">Samuel
Sachs</a></td></tr><tr>'''
soup = bs4.BeautifulSoup(html,'html.parser')
rows = soup.find_all('tr')
founders = []
for row in rows:
children = row.findChildren("a" , recursive=True, text=True)
for child in children:
child_text = child.text.split('\n')
child_text = [ x.strip() for x in child_text ]
child_text = ' '.join(child_text)
founders.append(child_text)

BeautifulSoup find text through 2 terms in html tag - Python 3

I am trying to scrape some text from a html file however i need 2 types of text which differ from each other by a term (contextref) in their tags, for example:
1) <ix:nonfraction contextref="cfwd_30_04_2016" name="ns5:TangibleFixedAssets" unitref="GBP" decimals="0" format="ixt2:numdotdecimal" scale="0" xmlns:ix="http://www.xbrl.org/2008/inlineXBRL">180,649</ix:nonfraction>
2) <ix:nonfraction contextref="cfwd_30_04_2015" name="ns5:TangibleFixedAssets" unitref="GBP" decimals="0" format="ixt2:numdotdecimal" scale="0" xmlns:ix="http://www.xbrl.org/2008/inlineXBRL">200,395</ix:nonfraction>
at the moment my code to find the text is : var1=(soup.find('ix:nonfraction',{'name':'uk-gaap:{}'.format(variable)}).text) which for the examples above give: 180,649.
For me to be able to get both values i would need another variable to include another term along with name, (being contextref) ive played around with different combinations but cant seem to make it work.
Any help would be great, thanks
import bs4
html = '''<ix:nonfraction contextref="cfwd_30_04_2016" name="ns5:TangibleFixedAssets" unitref="GBP" decimals="0" format="ixt2:numdotdecimal" scale="0" xmlns:ix="http://www.xbrl.org/2008/inlineXBRL">180,649</ix:nonfraction>
<ix:nonfraction contextref="cfwd_30_04_2015" name="ns5:TangibleFixedAssets" unitref="GBP" decimals="0" format="ixt2:numdotdecimal" scale="0" xmlns:ix="http://www.xbrl.org/2008/inlineXBRL">200,395</ix:nonfraction>'''
soup = bs4.BeautifulSoup(html, 'lxml')
var1, var2 = [i.text for i in soup.find_all('ix:nonfraction')]
out:
('180,649', '200,395')
you can use contextref as key word in find_all():
soup.find_all('ix:nonfraction', contextref=True)
This means filter the tag which has contextref attribute.

How to print multiple values from BeautifulSoup with Python

I'm trying to scrape two values from a webpage using BeautifulSoup. When printing only one value, the content looks good. However, when printing two values (to the same line), html-code is displayed around the one of the values..
Here is my code:
from bs4 import BeautifulSoup
import urllib.request as urllib2
list_open = open("source.txt")
read_list = list_open.read()
line_in_list = read_list.split("\n")
i = 0
for url in line_in_list:
soup = BeautifulSoup(urllib2.urlopen(url).read(), 'html.parser')
sku = soup.find_all(attrs={'class': "identifier"})
description = soup.find_all(attrs={'class': "description"})
for text in description:
print((sku), text.getText())
i += 1
And the output looks like this:
[<span class="identifier">112404</span>] A natural for...etc
[<span class="identifier">110027</span>] After what...etc
[<span class="identifier">03BA5730</span>] Argentina is know...etc
[<span class="identifier">090030</span>] To be carried...etc
The output should preferably be without the [<span class="identifier">-thing around the numbers...
I guess the problem is in the last for-loop, but I have no idea how to correct it. All help is appreciated. Thanks! -Espen
It looks like you need to zip() identifiers and descriptions and call getText() for every tag found in the loop:
identifiers = soup.find_all(attrs={'class': "identifier"})
descriptions = soup.find_all(attrs={'class': "description"})
for identifier, description in zip(identifiers, descriptions):
print(identifier.getText(), description.getText())
find_all() returns a ResultSet, which is more or less a fancy list. Printing a ResultSet will include the surrounding left and right square brackets that typically denote a list, and the items (tags) will be displayed within.
Your sample output suggests that the HTML for each URL contains one SKU and one description per URL. If that is correct then your code could just pick off the first item in each ResultSet like this:
sku = soup.find_all(attrs={'class': "identifier"})
description = soup.find_all(attrs={'class': "description"})
print(sku[0].get_text(), description[0].get_text())
Or, you could just find the first of each using find():
sku = soup.find(attrs={'class': "identifier"})
description = soup.find(attrs={'class': "description"})
print(sku.get_text(), description.get_text())
However, your code suggests that there can be multiple descriptions for each SKU because you are iterating over the description result set. Perhaps there can be multiple SKUs and descriptions per page (in which case see #alecxe's answer)? It's difficult to tell.
If you could update your question by adding live URLs or sample HTML we could offer better advice.

Categories

Resources