Having issues scraping the image from a website using bs4 - python

Hey I can't seem to scrape the images from this website
https://www.nike.com/gb/w/new-mens-shoes-3n82yznik1zy7ok
I am using the following code
product.find('img', {'class': 'css-1fxh5tw product-card__hero-image'})['src']]
It returns this
data:image/gif;base64,R0lGODlhAQABAIAAAAAAAP///yH5BAEAAAAALAAAAAABAAEAAAIBRAA7

Your code was not wrong. I have extracted images
import requests
from bs4 import BeautifulSoup
url ="https://www.nike.com/gb/w/new-mens-shoes-3n82yznik1zy7ok"
response = requests.get(url)
soup = BeautifulSoup(response.text, 'lxml')
images = soup.find_all('img', {'class':'css-1fxh5tw product-card__hero-image'},src=True)
for i in images:
if 'data:image' not in i['src']:
print(i['src'])

Related

how to put web scraped data into a list

this is the code I used to get the data from a website with all the wordle possible words, im trying to put them in a list so I can create a wordle clone but I get a weird output when I do this. please help
import requests
from bs4 import BeautifulSoup
url = "https://raw.githubusercontent.com/tabatkins/wordle-list/main/words"
page = requests.get(url)
soup = BeautifulSoup(page.content, 'html.parser')
word_list = list(soup)
It do not need BeautifulSoup, simply split the text of the response:
import requests
url = "https://raw.githubusercontent.com/tabatkins/wordle-list/main/words"
requests.get(url).text.split()
Or if you like to do it wit BeautifulSoup anyway:
import requests
from bs4 import BeautifulSoup
url = "https://raw.githubusercontent.com/tabatkins/wordle-list/main/words"
page = requests.get(url)
soup = BeautifulSoup(page.content, 'html.parser')
soup.text.split()
Output:
['women',
'nikau',
'swack',
'feens',
'fyles',
'poled',
'clags',
'starn',...]

How to scrape main headings of a website using python in colab?

Hi I am a beginner and would like to get the list of all datasets from the website 'https://www.kaggle.com/datasets' based on the filters 'csv' and 'only datasets with tasks'.
I applied the filters and inspected the element. My attempt returns an empty list. This is my code
from urllib.request import urlopen
from bs4 import BeautifulSoup
url = 'https://www.kaggle.com/datasets?sort=usability&fileType=csv&tasks=true'
html = urlopen(url)
soup = BeautifulSoup(response.text, 'lxml')
titles = soup.find_all('li')
print(titles)
Can anyone help?

How to get specific urls from a website in a class tag with beautiful soup? (Python)

I'm trying to get the urls of the main articles from a news outlet using beautiful soup. Since I do not want to get ALL of the links on the entire page, I specified the class. My code only manages to display the titles of the news articles, not the links. This is the website: https://www.reuters.com/news/us
Here is what I have so far:
import requests
from bs4 import BeautifulSoup
req = requests.get('https://www.reuters.com/news/us').text
soup = BeautifulSoup(req, 'html.parser')
links = soup.findAll("h3", {"class": "story-title"})
for i in links:
print(i.get_text().strip())
print()
Any help is greatly apreciated!
To get link to all articles you can use following code:
import requests
from bs4 import BeautifulSoup
req = requests.get('https://www.reuters.com/news/us').text
soup = BeautifulSoup(req, 'html.parser')
links = soup.findAll("div", {"class": "story-content"})
for i in links:
print(i.a.get('href'))

Download flickr images using Python

from urllib.request import urlopen
from bs4 import BeautifulSoup
import random
google = "https://www.flickr.com/search/?text=Nike"
page = urlopen(google).read()
soup = BeautifulSoup(page, "html.parser")
img_tags = soup.find_all('img')
Now I'm stuck.
I'm trying to download all images on the link, can someone help me?

Extracting image caption and image url using BeautifulSoup

I am trying to extract the image url and image caption from an article using BeautifulSoup. I can separate the article's image url and image caption from the preceding and following HTML but I can not figure out how to separate these two from their html tags . Here is my code:
from bs4 import BeautifulSoup
import requests
url = 'http://www.prnewswire.com/news-releases/dutch-philosopher-
koert-van-mensvoort-founder-of-the-next-nature-network-writes-a-
letter-to-humanity-619925063.html'
r = requests.get(url)
html = r.text
soup = BeautifulSoup(html, 'lxml')
links = soup.find_all('div', {'class': 'image'})
The two sections I am trying to extract are the src= and the title= sections. Any ideas on how to accomplish these two parses would be appreciated.
from bs4 import BeautifulSoup
import requests
url = 'http://www.prnewswire.com/news-releases/dutch-philosopher-koert-van-mensvoort-founder-of-the-next-nature-network-writes-a-letter-to-humanity-619925063.html'
r = requests.get(url)
html = r.text
soup = BeautifulSoup(html, 'lxml')
links = soup.find_all('div', {'class': 'image'})
print [i.find('img')['src'] for i in links]
print [i.find('img')['title'] for i in links]
try the following to extract all the image tags
img = soup.findAll('img')
#depending on how many images are here you will probably need to loop through img
src = img.get('src')
title = img.get('title')
Late answer, but you can use:
from bs4 import BeautifulSoup
import requests
url = 'http://www.prnewswire.com/news-releases/dutch-philosopher-koert-van-mensvoort-founder-of-the-next-nature-network-writes-a-letter-to-humanity-619925063.html'
r = requests.get(url)
html = r.text
soup = BeautifulSoup(html, "html5lib")
links = soup.find_all('div', {'class': 'image'})
if links:
print(links[0].find('img')['src'])
print(links[0].find('img')['title'])
Output:
http://mma.prnewswire.com/media/491859/Koert_van_Mensvoort.jpg?w=950
Dutch philosopher Koert van Mensvoort – founder of the Next Nature
Network and Fellow of ‘Next Nature’ at the University of Technology in
Eindhoven – has written a ‘Letter to Humanity’ in support of
International Earth Day. (PRNewsfoto/Next Nature Network)

Categories

Resources