html content changes when using beautifulSoup

html content changes when using beautifulSoup - python

I am trying to extract the attribute value of src from a block of html, the html block is :
<img class="product-image first-image" src="https://cache.net-a-porter.com/images/products/1083507/1083507_in_pp.jpg">
my code is :
import requests
import json
from bs4 import BeautifulSoup
import re
headers = {'User-agent': 'Mozilla/5.0'}
url = 'https://www.net-a-porter.com/us/en/product/1083507/maje/layered-plaid-twill-and-stretch-cotton-jersey-top'
page = requests.get(url, headers=headers)
soup = BeautifulSoup(page.content, 'html.parser')
if url.find('net-a-porter')!=-1 :
i = soup.find_all('img', class_="product-image first-image")[0]["src"]
print i
the result i get:
//cache.net-a-porter.com/images/products/1083507/1083507_in_xs.jpg
but i want to get what is exactly in original html, which should be:
https://cache.net-aporter.com/images/products/1083507/1083507_in_pp.jpg
my result is different from the original src value, the http:is gone, and 1083507_in_pp changes to 1083507_in_xs. I don't know why it happens, does anyone know how to solve this? Thanks!

You are close, however, you need to access the "src" key from the builtin attrs key:
if url.find('net-a-porter')!=-1 :
i = soup.find_all('img', class_="product-image first-image")[0]
print i['src']

Related

Can't scrape <h3> tag from page

Seems like i can scrape any tag and class, except h3 on this page. It keeps returning None or an empty list. I'm trying to get this h3 tag:
...on the following webpage:
https://www.empireonline.com/movies/features/best-movies-2/
And this is the code I use:
from bs4 import BeautifulSoup
import requests
from pprint import pprint
from bs4 import BeautifulSoup
URL = "https://www.empireonline.com/movies/features/best-movies-2/"
response = requests.get(URL)
web_html = response.text
soup = BeautifulSoup(web_html, "html.parser")
movies = soup.findAll(name = "h3" , class_ = "jsx-4245974604")
movies_text=[]
for item in movies:
result = item.getText()
movies_text.append(result)
print(movies_text)
Can you please help with the solution for this problem?

As other people mentioned this is dynamic content, which needs to be generated first when opening/running the webpage. Therefore you can't find the class "jsx-4245974604" with BS4.
If you print out your "soup" variable you actually can see that you won't find it. But if simply you want to get the names of the movies you can just use another part of the html in this case.
The movie name is in the alt tag of the picture (and actually also in many other parts of the html).
import requests
from pprint import pprint
from bs4 import BeautifulSoup
URL = "https://www.empireonline.com/movies/features/best-movies-2/"
response = requests.get(URL)
web_html = response.text
soup = BeautifulSoup(web_html, "html.parser")
movies = soup.findAll("img", class_="jsx-952983560")
movies_text=[]
for item in movies:
result = item.get('alt')
movies_text.append(result)
print(movies_text)
If you run into this issue in the future, remember to just print out the initial html you can get with soup and just check by eye if the information you need can be found.

Not getting json when using .text in bs4

In this code I think I made a mistake or something because I'm not getting the correct json when I print it, indeed I get nothing but when I index the script I get the json but using .text nothing appears I want the json alone.
CODE :
from bs4 import BeautifulSoup
from urllib.parse import quote_plus
import requests
import selenium.webdriver as webdriver
base_url = 'https://www.instagram.com/{}'
search = input('Enter the instagram account: ')
final_url = base_url.format(quote_plus(search))
response = requests.get(final_url)
print(response.status_code)
if response.ok:
html = response.text
bs_html = BeautifulSoup(html)
scripts = bs_html.select('script[type="application/ld+json"]')
print(scripts[0].text)

Change the line print(scripts[0].text) to print(scripts[0].string).
scripts[0] is a Beautiful Soup Tag object, and its string contents can be accessed through the .string property.
Source: https://www.crummy.com/software/BeautifulSoup/bs4/doc/#string
If you want to then turn the string into a json so that you can access the data, you can do something like this:
...
if response.ok:
html = response.text
bs_html = BeautifulSoup(html)
scripts = bs_html.select('script[type="application/ld+json"]')
json_output = json.loads(scripts[0].string)
Then, for example, if you run print(json_output['name']) you should be able to access the name on the account.

BeautifulSoup findAll returns empty list when selecting class

findall() returns empty list when specifying class
Specifying tags work fine
import urllib2
from bs4 import BeautifulSoup
url = "https://www.reddit.com/r/Showerthoughts/top/?sort=top&t=week"
hdr = { 'User-Agent' : 'tempro' }
req = urllib2.Request(url, headers=hdr)
htmlpage = urllib2.urlopen(req).read()
BeautifulSoupFormat = BeautifulSoup(htmlpage,'lxml')
name_box = BeautifulSoupFormat.findAll("a",{'class':'title'})
for data in name_box:
print(data.text)
I'm trying to get only the text of the post. The current code prints out nothing. If I remove the {'class':'title'} it prints out the post text as well as username and comments of the post which I don't want.
I'm using python2 with the latest versions of BeautifulSoup and urllib2

To get all the comments you are going to need a method like selenium which will allow you to scroll. Without that, just to get initial results, you can grab from a script tag in the requests response
import requests
from bs4 import BeautifulSoup as bs
import re
import json
headers = {'User-Agent' : 'Mozilla/5.0'}
r = requests.get('https://www.reddit.com/r/Showerthoughts/top/?sort=top&t=week', headers = headers)
soup = bs(r.content, 'lxml')
script = soup.select_one('#data').text
p = re.compile(r'window.___r = (.*); window')
data = json.loads(p.findall(script)[0])
for item in data['posts']['models']:
print(data['posts']['models'][item]['title'])

The selector you try to use is not good, because you do not have a class = "title" for those posts. Please try this below:
name_box = BeautifulSoupFormat.select('a[data-click-id="body"] > h2')
this finds all the <a data-click-id="body"> where you have <h2> tag that contain the post text you need
More about selectors using BeatufulSoup you can read here:
(https://www.crummy.com/software/BeautifulSoup/bs4/doc/#css-selectors)

Web scraping with beautifulsoup not finding anything

I'm trying to scrape coinmarketcap.com just to get an update of a certain currency price, also just to learn how to web scrape. I'm still a beginner and can't figure out where I'm going wrong, because whenever I try to run it, it just tells me there are none. Although I know that line does exist. Any help is appreciated!
import requests
from bs4 import BeautifulSoup
url = 'https://coinmarketcap.com/currencies/electroneum/'
response = requests.get(url)
html = response.content
soup = BeautifulSoup(html, 'html.parser')
price = soup.find('data-currency-price data-usd=')
print (price)

If you are going to be doing alot of this consider doing a single call using the official API and get all the prices. Then extract what you want. The following is from the site with an amendment by me to show the desired value for electroneum. The API guidance shows how to retrieve one at a time as well, though that requires a higher plan than the basic.
from requests import Request, Session
from requests.exceptions import ConnectionError, Timeout, TooManyRedirects
import json
url = 'https://pro-api.coinmarketcap.com/v1/cryptocurrency/listings/latest'
parameters = {
'start': '1',
'limit': '5000',
'convert': 'USD',
}
headers = {
'Accepts': 'application/json',
'X-CMC_PRO_API_KEY': 'yourKey',
}
session = Session()
session.headers.update(headers)
try:
response = session.get(url, params=parameters)
# print(response.text)
data = json.loads(response.text)
print(data['data'][64]['quote']['USD']['price'])
except (ConnectionError, Timeout, TooManyRedirects) as e:
print(e)
You can always deploy a loop and check against a desired list e.g.
interested = ['Electroneum','Ethereum']
for item in data['data']:
if item['name'] in interested:
print(item)
For your current example:
You can use an attribute selector for data-currency-value
import requests
from bs4 import BeautifulSoup
url = 'https://coinmarketcap.com/currencies/electroneum/'
response = requests.get(url)
html = response.content
soup = BeautifulSoup(html, 'html.parser')
soup.select_one('[data-currency-value]').text

You can use the class attribute to get the value.
import requests
from bs4 import BeautifulSoup
url = 'https://coinmarketcap.com/currencies/electroneum/'
response = requests.get(url)
html = response.content
soup = BeautifulSoup(html, 'html.parser')
price = soup.find('span' ,attrs={"class" : "h2 text-semi-bold details-panel-item--price__value"})
print (price.text)
Output :
0.006778

You can get the value like this:
import requests
from bs4 import BeautifulSoup
url = 'https://coinmarketcap.com/currencies/electroneum/'
response = requests.get(url)
html = response.content
soup = BeautifulSoup(html, 'html.parser')
price = soup.find("span", id="quote_price").get('data-usd')
print (price)

You should try to be more specific in how you want to FIND the item.
you currently are using soup.find('') I am not sure what you have put inside this as you wrote data-currency-price data-usd=
Is that an ID a class name?
why not try finding the item using an ID.
soup.find(id="link3")
or find by tag
soup.find("relevant tag name like div or a")
or something like this
find_this = soup.find("a", id="ID HERE")

import requests
from bs4 import BeautifulSoup
url = 'https://coinmarketcap.com/currencies/electroneum/'
response = requests.get(url)
html = response.content
soup = BeautifulSoup(html, 'html.parser')
x=soup(id="quote_price").text
print (x)
Look for ID better,or search through soup.find_all(text="data-currency-price data-usd")[1].text

Beautiful Soup leaves out tags

I would like to parse an HTML file with python, but BeautifulSoup leaves out some key tags.
The part of the HTML file on the website looks like this, with all of the children divs.
HTML snippet
But when using the beautifulsoup prettify function, it looks like this, without any of the children divs.
HTML snippet from python
The code I used is here:
from bs4 import BeautifulSoup
import urllib.request
#A random plus code, the %2B is just a +
PLUS_CODE = "792F7C4F%2B54"
url = "https://www.plus.codes/" + PLUS_CODE
hdr = {"User-Agent" : "Mozilla/5.0"}
req = urllib.request.Request(url, headers=hdr)
r = urllib.request.urlopen(req)
r_tags = r.read().decode('utf-8')
soup = BeautifulSoup(r_tags, "lxml")
print(soup.prettify())
What ends up happening is that I can't reach the children div and extract the text that I need.

Try 'lxml' instead of 'html.parser' in the BeautifulSoup method. Maybe that will solve the problem. If not, share some code.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

html content changes when using beautifulSoup - python

You are close, however, you need to access the "src" key from the builtin attrs key: if url.find('net-a-porter')!=-1 : i = soup.find_all('img', class_="product-image first-image")[0] print i['src']

Related

Can't scrape <h3> tag from page

Not getting json when using .text in bs4

BeautifulSoup findAll returns empty list when selecting class

Web scraping with beautifulsoup not finding anything

Beautiful Soup leaves out tags

Categories

Resources