BeautifulSoup deleting first half of HTML? - python

I'm practicing with BeautifulSoup and HTML requests in general for the first time. The goal of the programme is to load a webpage and it's HTML, then search through the webpage (in this case a recipe, to get a sub string of it's ingredients). I've managed to get it working with the following code:
url = "https://www.bbcgoodfood.com/recipes/healthy-tikka-masala"
result = requests.get(url)
myHTML = result.text
index1 = myHTML.find("recipeIngredient")
index2 = myHTML.find("recipeInstructions")
ingredients = myHTML[index1:index2]
But when I try and use BeautifulSoup here:
url = "https://www.bbcgoodfood.com/recipes/healthy-tikka-masala"
result = requests.get(url)
doc = BeautifulSoup(result.text, "html.parser")
ingredients = doc.find(text = "recipeIngredient")
print(ingredients)
I understand that the code above (even if I could get it working) would produce a different output of just ["recipeIngredient"] but that's all I'm focused on for now whilst I get to grips with BS. Instead the code above just outputs None. I printed "doc" to the terminal and it would only output what appears to be the second half of the HTML (or at least : not all of it). Whereas , the text file does contain all HTML, so I assume that's where the problem lies but i'm not sure how to fix it.
Thank you.

You need to use:
class_="recipe__ingredients"
For example:
import requests
from bs4 import BeautifulSoup
url = "https://www.bbcgoodfood.com/recipes/healthy-tikka-masala"
doc = (
BeautifulSoup(requests.get(url).text, "html.parser")
.find(class_="recipe__ingredients")
)
ingredients = "\n".join(
ingredient.getText() for ingredient in doc.find_all("li")
)
print(ingredients)
Output:
1 large onion , chopped
4 large garlic cloves
thumb-sized piece of ginger
2 tbsp rapeseed oil
4 small skinless chicken breasts, cut into chunks
2 tbsp tikka spice powder
1 tsp cayenne pepper
400g can chopped tomatoes
40g ground almonds
200g spinach
3 tbsp fat-free natural yogurt
½ small bunch of coriander , chopped
brown basmati rice , to serve

It outputs None because it's looking for where the content within html tags is 'recipeIngredient', whci does not exist (there is no text in the html content. That string is an attribute of an html tag).
What you are actually trying to get with bs4 is find specific tags and/or attributes of the data/content you want. For example, #baduker points out, the ingredients in the html are within the tag with a class attribute = "recipe__ingredients".
The string 'recipeIngredient', that you pull out in that first block of code, is actually from within the <script> tag in the html, that has the ingredients in json format.
from bs4 import BeautifulSoup
import requests
import json
url = "https://www.bbcgoodfood.com/recipes/healthy-tikka-masala"
result = requests.get(url)
doc = BeautifulSoup(result.text, "html.parser")
ingredients = doc.find('script', {'type':'application/ld+json'}).text
jsonData = json.loads(ingredients)
print(jsonData['recipeIngredient'])

Related

How can I split the text elements out from this HTML String? Python

Good Morning,
I'm doing some HTML parsing in Python and I've run across the following which is a time & name pairing in a single table cell. I'm trying to extract each piece of info separately and have tried several different approaches to split the following string.
HTML String:
<span><strong>13:30</strong><br/>SecondWord</span></a>
My output would hopefully be:
text1 = 13:30
text2 = "SecondWord"
I'm currently using a loop through all the rows in the table, where I'm taking the text and splitting it by a new line. I noticed the HTML has a line break character in-between so it renders separately on the web, I was trying to replace this with a new line and run my split on that - however my string.replace() and re.sub() approaches don't seem to be working.
I'd love to know what I'm doing wrong.
Latest Approach:
resub_pat = r'<br/>'
rows=list()
for row in table.findAll("tr"):
a = re.sub(resub_pat,"\n",row.text).split("\n")
This is a bit hashed together, but I hope I've captured my problem! I wasn't able to find any similar issues.
You could try:
from bs4 import BeautifulSoup
import re
# the soup
soup = BeautifulSoup("<span><strong>13:30</strong><br/>SecondWord</span></a>", 'lxml')
# the regex object
rx = re.compile(r'(\d+:\d+)(.+)')
# time, text
text = soup.find('span').get_text()
x,y = rx.findall(text)[0]
print(x)
print(y)
Using recursive=False to get only direct text and strong.text to get the other one.
Ex:
from bs4 import BeautifulSoup
soup = BeautifulSoup("<span><strong>13:30</strong><br/>SecondWord</span></a>", 'lxml')
# text1
print(soup.find("span").strong.text) # --> 13:30
# text2
print(soup.find("span").find(text=True, recursive=False)) # --> SecondWord
from bs4 import BeautifulSoup
txt = '''<span><strong>13:30</strong><br/>SecondWord</span></a>'''
soup = BeautifulSoup(txt, 'html.parser')
text1, text2 = soup.span.get_text(strip=True, separator='|').split('|')
print(text1)
print(text2)
Prints:
13:30
SecondWord

Scraping images related to a specific content in websites

I have a list of websites and I would need to collect images possibly related to a specific article/news.
For example:
Link: https://www.bbc.com/sport/football/34893228
News I am interesting in begins with: Manchester United manager Louis van Gaal confirms he wants to bring Cristiano Ronaldo back to Old Trafford.
Link: https://www.bbc.co.uk/sport/football/34923104
News begins with: BBC Sport looks at the best quotes from legendary Manchester United wingers George Best and Cristiano Ronaldo.
I would need to get all the images related to those news.
I specified the news beginning because sometimes in the same page there can be other articles/news that we are not interested in.
I have tried as follows:
import pandas as pd
from IPython.core.display import HTML
df = pd.DataFrame({'Website': ['https://www.bbc.com/sport/football/34893228', 'https://www.bbc.co.uk/sport/football/34923104']})
img=[]
for x in df.Website:
print(x)
html = urlopen(x)
bs = BeautifulSoup(html, 'html.parser')
images = bs.find_all('img', {})
for image in images:
print(image['src']+'\n')
img.append(image['src'])
print(img)
def path_to_image_html(path):
return '<img src="'+ path + '" width="60" >'
pd.set_option('display.max_colwidth', -1)
HTML(df.to_html(escape=False ,formatters=dict(image=path_to_image_html)))
The code does not work. What I am trying to do there is to get all the images and try to store them as small pictures to include in the dataframe/dataset. The images should be stored in the same row of the website which are taken from.
I do not know how to include the text as input/search to do on the webpage. With two links it would be easy, but unfortunately I have a dataset of thousands link to scrape.
I hope you could help me.
Thank you.
If it's a simple image source description, you can get it with the following code; if you look at the website source, the image source is complex. You will need the corresponding code. I can only help you so far.
import pandas as pd
import urllib
from bs4 import BeautifulSoup
from IPython.core.display import HTML
df = pd.DataFrame(index=[], columns=['Website', 'contents'])
Website = ['https://www.bbc.com/sport/football/34893228', 'https://www.bbc.co.uk/sport/football/34923104']
loc = 0
for x in Website:
html = urllib.request.urlopen(x)
bs = BeautifulSoup(html, 'html.parser')
txts = bs.find_all('div', attrs={'id': 'story-body'})
for i, txt in enumerate(txts):
df.loc[i+loc,['Website']] = x
df.loc[i+loc,['contents']] = txt.text.replace('\n', '')
loc += 1
#print(i, x, txt.text.replace('\n', ''))
df.to_html('bbc.html', escape=False)

Remove <a> HTML tag from beautifulsoup results

Using beautifulsoup I'm able to scrape a web page with this code:
import requests
from bs4 import BeautifulSoup
page = requests.get("http://www.acbbroker.it/soci_dettaglio.php?r=3")
page
soup = BeautifulSoup(page.content, 'html.parser')
test = soup.find(id="paginainterna-content")
test_items = test.find_all(class_="entry-content")
tonight = test_items[0]
names = []
for x in tonight.find_all('a', itemprop="url"):
names.append(str(x))
print(names)
but I'm not able to clean the results and obtain only the content inside the < a > paragraph (removing also the href).
Here is a small snap of my result:
'A&B; Insurance e Reinsurance Brokers Srl', 'A.B.A. BROKERS SRL', 'ABC SRL BROKER E CONSULENTI DI ASSI.NE', 'AEGIS INTERMEDIA SAS',
What is the proper way to handle this kind of data and obtain a clean result?
Thank you
if you want only text from tag use get_text() method
for x in tonight.find_all('a', itemprop="url"):
names.append(x.get_text())
print(names)
better with list comprehension this is fastest
names = [x.get_text() for x in tonight.find_all('a', itemprop='url')]
I don't know what output you want but, the text you get it by changing this
names.append(str(x.get_text()))

Python - converting to list

import requests
from bs4 import BeautifulSoup
webpage = requests.get("http://www.nytimes.com/")
soup = BeautifulSoup(requests.get("http://www.nytimes.com/").text, "html.parser")
for story_heading in soup.find_all(class_="story-heading"):
articles = story_heading.text.replace('\n', '').replace(' ', '')
print (articles)
There is my code, it prints out a list of all the article titles on the website. I get strings:
Looking Back: 1980 | Funny, but Not Fit to Print
Brooklyn Studio With Room for Family and a Dog
Search for Homes for Sale or Rent
Sell Your Home
So, I want to convert this to a list = ['Search for Homes for Sale or Rent', 'Sell Your Home', ...], witch will allow me to make some other manipulations like random.choice etc.
I tried:
alist = articles.split("\n")
print (alist)
['Looking Back: 1980 | Funny, but Not Fit to Print']
['Brooklyn Studio With Room for Family and a Dog']
['Search for Homes for Sale or Rent']
['Sell Your Home']
It is not a list that I need. I'm stuck. Can you please help me with this part of code.
You are constantly overwriting articles with the next value in your list. What you want to do instead is make articles a list, and just append in each iteration:
import requests
from bs4 import BeautifulSoup
webpage = requests.get("http://www.nytimes.com/")
soup = BeautifulSoup(requests.get("http://www.nytimes.com/").text, "html.parser")
articles = []
for story_heading in soup.find_all(class_="story-heading"):
articles.append(story_heading.text.replace('\n', '').replace(' ', ''))
print (articles)
The output is huge, so this is a small sample of what it looks like:
['Global Deal Reached to Curb Chemical That Warms Planet', 'Accord Could Push A/C Out of Sweltering India’s Reach ',....]
Furthermore, you only need to strip spaces in each iteration. You don't need to do those replacements. So, you can do this with your story_heading.text instead:
articles.append(story_heading.text.strip())
Which, can now give you a final solution looking like this:
import requests
from bs4 import BeautifulSoup
webpage = requests.get("http://www.nytimes.com/")
soup = BeautifulSoup(requests.get("http://www.nytimes.com/").text, "html.parser")
articles = [story_heading.text.strip() for story_heading in soup.find_all(class_="story-heading")]
print (articles)

Python BeautifulSoup extracting text from result

I am trying to get the text from contents but when i try beautiful soup functions on the result variable it results in errors.
from bs4 import BeautifulSoup as bs
import requests
webpage = 'http://www.dictionary.com/browse/coypu'
r = requests.get(webpage)
page_text = r.text
soup = bs(page_text, 'html.parser')
result = soup.find_all('meta', attrs={'name':'description'})
print (result.get['contents'])
I am trying to get the result to read;
"Coypu definition, a large, South American, aquatic rodent, Myocastor (or Myopotamus) coypus, yielding the fur nutria. See more."
soup.find_all() returns a list. Since in your case, it returns only one element in the list, you can do:
>>> type(result)
<class 'bs4.element.ResultSet'>
>>> type(result[0])
<class 'bs4.element.ResultSet'>
>>> result[0].get('content')
Coypu definition, a large, South American, aquatic rodent, Myocastor (or Myopotamus) coypus, yielding the fur nutria. See more.
When you only want the first or a single tag use find, find_all returns a list/resultSet:
result = soup.find('meta', attrs={'name':'description'})["contents"]
You can also use a css selector with select_one:
result = soup.select_one('meta[name=description]')["contents"]
you need not to use findall as only by using find you can get desired output'
from bs4 import BeautifulSoup as bs
import requests
webpage = 'http://www.dictionary.com/browse/coypu'
r = requests.get(webpage)
page_text = r.text
soup = bs(page_text, 'html.parser')
result = soup.find('meta', {'name':'description'})
print result.get('content')
it will print:
Coypu definition, a large, South American, aquatic rodent, Myocastor (or Myopotamus) coypus, yielding the fur nutria. See more.

Categories

Resources