I have a list of websites and I would need to collect images possibly related to a specific article/news.
For example:
Link: https://www.bbc.com/sport/football/34893228
News I am interesting in begins with: Manchester United manager Louis van Gaal confirms he wants to bring Cristiano Ronaldo back to Old Trafford.
Link: https://www.bbc.co.uk/sport/football/34923104
News begins with: BBC Sport looks at the best quotes from legendary Manchester United wingers George Best and Cristiano Ronaldo.
I would need to get all the images related to those news.
I specified the news beginning because sometimes in the same page there can be other articles/news that we are not interested in.
I have tried as follows:
import pandas as pd
from IPython.core.display import HTML
df = pd.DataFrame({'Website': ['https://www.bbc.com/sport/football/34893228', 'https://www.bbc.co.uk/sport/football/34923104']})
img=[]
for x in df.Website:
print(x)
html = urlopen(x)
bs = BeautifulSoup(html, 'html.parser')
images = bs.find_all('img', {})
for image in images:
print(image['src']+'\n')
img.append(image['src'])
print(img)
def path_to_image_html(path):
return '<img src="'+ path + '" width="60" >'
pd.set_option('display.max_colwidth', -1)
HTML(df.to_html(escape=False ,formatters=dict(image=path_to_image_html)))
The code does not work. What I am trying to do there is to get all the images and try to store them as small pictures to include in the dataframe/dataset. The images should be stored in the same row of the website which are taken from.
I do not know how to include the text as input/search to do on the webpage. With two links it would be easy, but unfortunately I have a dataset of thousands link to scrape.
I hope you could help me.
Thank you.
If it's a simple image source description, you can get it with the following code; if you look at the website source, the image source is complex. You will need the corresponding code. I can only help you so far.
import pandas as pd
import urllib
from bs4 import BeautifulSoup
from IPython.core.display import HTML
df = pd.DataFrame(index=[], columns=['Website', 'contents'])
Website = ['https://www.bbc.com/sport/football/34893228', 'https://www.bbc.co.uk/sport/football/34923104']
loc = 0
for x in Website:
html = urllib.request.urlopen(x)
bs = BeautifulSoup(html, 'html.parser')
txts = bs.find_all('div', attrs={'id': 'story-body'})
for i, txt in enumerate(txts):
df.loc[i+loc,['Website']] = x
df.loc[i+loc,['contents']] = txt.text.replace('\n', '')
loc += 1
#print(i, x, txt.text.replace('\n', ''))
df.to_html('bbc.html', escape=False)
Related
I'm practicing with BeautifulSoup and HTML requests in general for the first time. The goal of the programme is to load a webpage and it's HTML, then search through the webpage (in this case a recipe, to get a sub string of it's ingredients). I've managed to get it working with the following code:
url = "https://www.bbcgoodfood.com/recipes/healthy-tikka-masala"
result = requests.get(url)
myHTML = result.text
index1 = myHTML.find("recipeIngredient")
index2 = myHTML.find("recipeInstructions")
ingredients = myHTML[index1:index2]
But when I try and use BeautifulSoup here:
url = "https://www.bbcgoodfood.com/recipes/healthy-tikka-masala"
result = requests.get(url)
doc = BeautifulSoup(result.text, "html.parser")
ingredients = doc.find(text = "recipeIngredient")
print(ingredients)
I understand that the code above (even if I could get it working) would produce a different output of just ["recipeIngredient"] but that's all I'm focused on for now whilst I get to grips with BS. Instead the code above just outputs None. I printed "doc" to the terminal and it would only output what appears to be the second half of the HTML (or at least : not all of it). Whereas , the text file does contain all HTML, so I assume that's where the problem lies but i'm not sure how to fix it.
Thank you.
You need to use:
class_="recipe__ingredients"
For example:
import requests
from bs4 import BeautifulSoup
url = "https://www.bbcgoodfood.com/recipes/healthy-tikka-masala"
doc = (
BeautifulSoup(requests.get(url).text, "html.parser")
.find(class_="recipe__ingredients")
)
ingredients = "\n".join(
ingredient.getText() for ingredient in doc.find_all("li")
)
print(ingredients)
Output:
1 large onion , chopped
4 large garlic cloves
thumb-sized piece of ginger
2 tbsp rapeseed oil
4 small skinless chicken breasts, cut into chunks
2 tbsp tikka spice powder
1 tsp cayenne pepper
400g can chopped tomatoes
40g ground almonds
200g spinach
3 tbsp fat-free natural yogurt
½ small bunch of coriander , chopped
brown basmati rice , to serve
It outputs None because it's looking for where the content within html tags is 'recipeIngredient', whci does not exist (there is no text in the html content. That string is an attribute of an html tag).
What you are actually trying to get with bs4 is find specific tags and/or attributes of the data/content you want. For example, #baduker points out, the ingredients in the html are within the tag with a class attribute = "recipe__ingredients".
The string 'recipeIngredient', that you pull out in that first block of code, is actually from within the <script> tag in the html, that has the ingredients in json format.
from bs4 import BeautifulSoup
import requests
import json
url = "https://www.bbcgoodfood.com/recipes/healthy-tikka-masala"
result = requests.get(url)
doc = BeautifulSoup(result.text, "html.parser")
ingredients = doc.find('script', {'type':'application/ld+json'}).text
jsonData = json.loads(ingredients)
print(jsonData['recipeIngredient'])
I'm fairly new to web scraping in Python; and after reading most of the tutorials on the topic online I decided to give it a shot. I finally got one site working but the output is not formatted properly.
import requests
import bs4
from bs4 import BeautifulSoup
import pandas as pd
import time
page = requests.get("https://leeweebrothers.com/our-food/lunch-boxes/#")
soup = BeautifulSoup(page.text, "html.parser")
for div in soup.find_all('h2'): #prints the name of the food"
print(div.text)
for a in soup.find_all('span', {'class' : 'amount'}): #prints price of the food
print(a.text)
Output
I want both the name of the food to be printed side by side with the corresponding price of the food, concatenated by a "-" ... Would appreciate any help given, thanks!
Edit: After #Reblochon Masque comments below - I've run into another problem; As you can see there is a $0.00 which is a value from the inbuilt shopping cart on the website, how would i exclude this as an outlier and continue moving down the loop while ensuring that the other items in the price "move up" to correspond to the correct food?
Best practice is to use zip function in the for loop, but we can do that this way also. This is to just to show we can do by using indexing the two lists.
names = soup.find_all('h2')
rest = soup.find_all('span', {'class' : 'amount'})
for index in range(len(names)):
print('{} - {}'.format(names[index].text, rest[index].text))
You could maybe zip the two results:
names = soup.find_all('h2')
rest = soup.find_all('span', {'class' : 'amount'})
for div, a in zip(names, rest):
print('{} - {}'.format(div.text, a.text))
# print(f"{div.text} - {a.text}") # for python > 3.6
Using beautifulsoup I'm able to scrape a web page with this code:
import requests
from bs4 import BeautifulSoup
page = requests.get("http://www.acbbroker.it/soci_dettaglio.php?r=3")
page
soup = BeautifulSoup(page.content, 'html.parser')
test = soup.find(id="paginainterna-content")
test_items = test.find_all(class_="entry-content")
tonight = test_items[0]
names = []
for x in tonight.find_all('a', itemprop="url"):
names.append(str(x))
print(names)
but I'm not able to clean the results and obtain only the content inside the < a > paragraph (removing also the href).
Here is a small snap of my result:
'A&B; Insurance e Reinsurance Brokers Srl', 'A.B.A. BROKERS SRL', 'ABC SRL BROKER E CONSULENTI DI ASSI.NE', 'AEGIS INTERMEDIA SAS',
What is the proper way to handle this kind of data and obtain a clean result?
Thank you
if you want only text from tag use get_text() method
for x in tonight.find_all('a', itemprop="url"):
names.append(x.get_text())
print(names)
better with list comprehension this is fastest
names = [x.get_text() for x in tonight.find_all('a', itemprop='url')]
I don't know what output you want but, the text you get it by changing this
names.append(str(x.get_text()))
I have the following Python codes running in my Jupyter notebook:
from lxml.html import parse
tree = parse('http://www.imdb.com/chart/top')
movies = tree.findall('.//table[#class="chart full-width"]//td[#class="titleColumn"]//a')
movies[0].text_content()
The above codes give me the following output:
'The Shawshank Redemption'
Basically, it is the content of the first row of the column named 'titleColumn' on that webpage. In that same table there is another column called 'posterColumn' which contains a thumbnail image.
Now I want my codes to retrieve those images and the output to also show that image.
Do I need to use another package to achieve this? Can the image be shown in Jupyter Notebook?
To get the associated images, you need to get the posterColumn. From this you can extract the img src entry and pull the jpg images. The file can then be saved based on the movie title, with care to remove any non-valid filename characters such as ::
from lxml.html import parse
import requests
import string
valid_chars = "-_.() " + string.ascii_letters + string.digits
tree = parse('http://www.imdb.com/chart/top')
movies = tree.findall('.//table[#class="chart full-width"]//td[#class="titleColumn"]//a')
posters = tree.findall('.//table[#class="chart full-width"]//td[#class="posterColumn"]//a')
for p, m in zip(posters, movies):
for element, attribute, link, pos in p.iterlinks():
if attribute == 'src':
print "{:50} {}".format(m.text_content(), link)
poster_jpg = requests.get(link, stream=True)
valid_filename = ''.join(c for c in m.text_content() if c in valid_chars)
with open('{}.jpg'.format(valid_filename), 'wb') as f_jpg:
for chunk in poster_jpg:
f_jpg.write(chunk)
So currently you would see something starting as:
The Shawshank Redemption https://images-na.ssl-images-amazon.com/images/M/MV5BODU4MjU4NjIwNl5BMl5BanBnXkFtZTgwMDU2MjEyMDE#._V1_UY67_CR0,0,45,67_AL_.jpg
The Godfather https://images-na.ssl-images-amazon.com/images/M/MV5BZTRmNjQ1ZDYtNDgzMy00OGE0LWE4N2YtNTkzNWQ5ZDhlNGJmL2ltYWdlL2ltYWdlXkEyXkFqcGdeQXVyNjU0OTQ0OTY#._V1_UY67_CR1,0,45,67_AL_.jpg
The Godfather: Part II https://images-na.ssl-images-amazon.com/images/M/MV5BMjZiNzIxNTQtNDc5Zi00YWY1LThkMTctMDgzYjY4YjI1YmQyL2ltYWdlL2ltYWdlXkEyXkFqcGdeQXVyNjU0OTQ0OTY#._V1_UY67_CR1,0,45,67_AL_.jpg
I tried to write a simple python program with imdb package to extract movie information from their database, but I do not know why the code returns empty list. My guess is that the way I extract url info(by using (.*?)) from the website is wrong. I want to extract an url link from the webpage. Here's the code. Thanks!
import urllib
import re
import imdb
imdb_access = imdb.IMDb()
top_num = 5
movie_list = ["The Matrix","The Matrix","The Matrix","The Matrix","The Matrix"]
for x in xrange(0,top_num):
contain = imdb_access.search_movie(movie_list[x])
ID = contain[0].movieID #str type
htmltext = (urllib.urlopen("http://www.imdb.com/title/tt0133093/?ref_=nv_sr_1")).read()
# a pattern in the website
regex = regex = '<img alt="The Matrix Poster" title="The Matrix Poster" src="(.*?)" itemprop="image">'
pattern = re.compile(regex)
#print (str((pattern)))
result = re.findall(pattern,htmltext)
print result
#print type(htmltext)
I think the problem is with new lines can you have (.*\n*.*?)