Web scraping 'window' object

Web scraping 'window' object - python

I am trying to get the body text of news articles like this one:
https://elpais.com/espana/2022-07-01/yolanda-diaz-lanza-su-proyecto-politico-sumar-y-convoca-el-primer-acto.html
In the source code, it can be found after "articleBody".
I've tried using bs4 Beautifulsoup but it looks like it cannot access the 'window' object where the article body information is. I'm able to get the text by using string functions:
text = re.search('"articleBody":"(.*)","keywords"', source_code)
Where source_code is a string that contains the source code of the URL. However, this method looks pretty inefficient compared to using the bs4 methods when the page allows it. Any advice, please?

You're right about BeautifulSoup not being able to handle window objects. In fact, you need to use Selenium for that kind of thing. Here's an example on how to do so with Python 3 (you'll have to adapt it slightly if you want to work in Python 2):
from selenium import webdriver
import time
# Create a new instance of Chrome and go to the website we want to scrape
browser = webdriver.Chrome()
browser.get("http://www.elpais.com/")
time.sleep(5) # Let the browser load
# Find the div element containing the article content
div = browser.find_element_by_class_name('articleContent')
# Print out all the text inside the div
print(div.text)
Hope this helps!

Try:
import json
import requests
from bs4 import BeautifulSoup
url = "https://elpais.com/espana/2022-07-01/yolanda-diaz-lanza-su-proyecto-politico-sumar-y-convoca-el-primer-acto.html"
soup = BeautifulSoup(requests.get(url).content, "html.parser")
for ld_json in soup.select('[type="application/ld+json"]'):
data = json.loads(ld_json.text)
if "#type" in data and "NewsArticle" in data["#type"]:
break
print(data["articleBody"])
Prints:
A una semana de que arranque Sumar ...
Or:
text = soup.select_one('[data-dtm-region="articulo_cuerpo"]').get_text(
strip=True
)
print(text)

Related

Why is my parsed image link comming out in base64 format

i was trying to parse a image link from a website.
When i inspect the link on the website, it is this one :https://static.nike.com/a/images/c_limit,w_592,f_auto/t_product_v1/df7c2668-f714-4ced-9f8f-1f0024f945a9/chaussure-de-basketball-zoom-freak-3-MZpJZF.png but when i parse it with my code the output is data:image/gif;base64,R0lGODlhAQABAIAAAAAAAP///yH5BAEAAAAALAAAAAABAAEAAAIBRAA7.
from bs4 import BeautifulSoup
import requests
source = requests.get('https://www.nike.com/fr/w/hommes-chaussures-nik1zy7ok').text
soup = BeautifulSoup(source, 'lxml')
pair = soup.find('div', class_='product-card__body')
image_scr = pair.find('img', class_='css-1fxh5tw product-card__hero-image')['src']
print(image_scr)
I think the code isn't the issue but i don't know what's causing the link to come out in base64 format. So how could i set the code to render the link as .png ?

What happens?
First at all, take a look into your soup - There is the truth. Website provides not all information static, there are a lot things provided dynamically and also done by the browser -> So requests wont get this info this way.
Workaround
Take a look at the <noscript> next to your selection, it holds a smaller version of the image and is providing the src
Example
from bs4 import BeautifulSoup
import requests
source = requests.get('https://www.nike.com/fr/w/hommes-chaussures-nik1zy7ok').content
soup = BeautifulSoup(source, 'lxml')
pair = soup.find('div', class_='product-card__body')
image_scr = pair.select_one('noscript img.css-1fxh5tw.product-card__hero-image')['src']
print(image_scr)
Output
https://static.nike.com/a/images/c_limit,w_318,f_auto/t_product_v1/df7c2668-f714-4ced-9f8f-1f0024f945a9/chaussure-de-basketball-zoom-freak-3-MZpJZF.png
If you like a "big picture" just replace parameter w_318 with w_1000...
Edit
Concerning your comment - There are a lot more solutions, but still depending on what you like to do with the information and what you gonna work with.
Following approache uses selenium that is unlike requests rendering the website and give you the "right page source" back but also needs more resources then requests:
from bs4 import BeautifulSoup
from selenium import webdriver
driver = webdriver.Chrome('C:\Program Files\ChromeDriver\chromedriver.exe')
driver.get('https://www.nike.com/fr/w/hommes-chaussures-nik1zy7ok')
soup=BeautifulSoup(driver.page_source, 'html.parser')
pair = soup.find('div', class_='product-card__body')
image_scr = pair.select_one('img.css-1fxh5tw.product-card__hero-image')['src']
print(image_scr)
Output
https://static.nike.com/a/images/c_limit,w_592,f_auto/t_product_v1/df7c2668-f714-4ced-9f8f-1f0024f945a9/chaussure-de-basketball-zoom-freak-3-MZpJZF.png

As you want to grab src meaning image data, so downloading data from server using requests, you need to use .content format as follows:
source = requests.get('https://www.nike.com/fr/w/hommes-chaussures-nik1zy7ok').content

Web scraping with Python and beautifulsoup: What is saved by the BeautifulSoup function?

This question follows this previous question. I want to scrape data from a betting site using Python. I first tried to follow this tutorial, but the problem is that the site tipico is not available from Switzerland. I thus chose another betting site: Winamax. In the tutorial, the webpage tipico is first inspected, in order to find where the betting rates are located in the html file. In the tipico webpage, they were stored in buttons of class “c_but_base c_but". By writing the following lines, the rates could therefore be saved and printed using the Beautiful soup module:
from bs4 import BeautifulSoup
import urllib.request
import re
url = "https://www.tipico.de/de/live-wetten/"
try:
page = urllib.request.urlopen(url)
except:
print(“An error occured.”)
soup = BeautifulSoup(page, ‘html.parser’)
regex = re.compile(‘c_but_base c_but’)
content_lis = soup.find_all(‘button’, attrs={‘class’: regex})
print(content_lis)
I thus tried to do the same with the webpage Winamax. I inspected the page and found that the betting rates were stored in buttons of class "ui-touchlink-needsclick price odd-price". See the code below:
from bs4 import BeautifulSoup
import urllib.request
import re
url = "https://www.winamax.fr/paris-sportifs/sports/1/7/4"
try:
page = urllib.request.urlopen(url)
except Exception as e:
print(f"An error occurred: {e}")
soup = BeautifulSoup(page, 'html.parser')
regex = re.compile('ui-touchlink-needsclick price odd-price')
content_lis = soup.find_all('button', attrs={'class': regex})
print(content_lis)
The problem is that it prints nothing: Python does not find elements of such class (right?). I thus tried to print the soup object in order to see what the BeautifulSoup function was exactly doing. I added this line
print(soup)
When printing it (I do not show it the print of soup because it is too long), I notice that this is not the same text as what appears when I do a right click "inspect" of the Winamax webpage. So what is the BeautifulSoup function exactly doing? How can I store the betting rates from the Winamax website using BeautifulSoup?
EDIT: I have never coded in html and I'm a beginner in Python, so some terminology might be wrong, that's why some parts are in italics.

That's because the website is using JavaScript to display these details and BeautifulSoup does not interact with JS on it's own.
First try to find out if the element you want to scrape is present in the page source, if so you can scrape, pretty much everything! In your case the button/span tag's were not in the page source(meaning hidden or it's pulled through a script)
No <button> tag in the page source :
So I suggest using Selenium as the solution, and I tried a basic scrape of the website.
Here is the code I used :
from selenium import webdriver
option = webdriver.ChromeOptions()
option.add_argument('--headless')
option.binary_location = r'Your chrome.exe file path'
browser = webdriver.Chrome(executable_path=r'Your chromedriver.exe file path', options=option)
browser.get(r"https://www.winamax.fr/paris-sportifs/sports/1/7/4")
span_tags = browser.find_elements_by_tag_name('span')
for span_tag in span_tags:
print(span_tag.text)
browser.quit()
This is the output:
There are some junk data present in this output, but that's for you to figure out what you need and what you don't!

How to get CData from html using beautiful soup

I am trying to get a value from a webpage. In the source code of the webpage, the data is in CDATA format and also comes from a jQuery. I have managed to write the below code which gets a large amount of text, where the index 21 contains the information I need. However, this output is large and not in a format I understand. Within the output I need to isolate and output "redshift":"0.06" but dont know how. what is the best way to solve this.
import requests
from bs4 import BeautifulSoup
link = "https://wis-tns.weizmann.ac.il/object/2020aclx"
html = requests.get(link).text
soup = BeautifulSoup(html, "html.parser")
res = soup.findAll('b')
print soup.find_all('script')[21]

It can be done using the current approach you have. However, I'd advise against it. There's a neater way to do it by observing that the redshift value is present in a few convenient places on the page itself.
The following approach should work for you. It looks for tables on the page with the class "atreps-results-table" -- of which there are two. We take the second such table and look for the table cell with the class "cell-redshift". Then, we just print out its text content.
from bs4 import BeautifulSoup
import requests
link = 'https://wis-tns.weizmann.ac.il/object/2020aclx'
html = requests.get(link).text
soup = BeautifulSoup(html, 'html.parser')
tab = soup.find_all('table', {'class': 'atreps-results-table'})[1]
redshift = tab.find('td', {'class': 'cell-redshift'})
print(redshift.text)

Try simply:
soup.select_one('div.field-redshift > div.value>b').text

If you view the Page Source of the URL, you will find that there are two script elements that are having CDATA. But the script element in which you are interested has jQuery in it. So you have to select the script element based on this knowledge. After that, you need to do some cleaning to get rid of CDATA tags and jQuery. Then with the help of json library, convert JSON data to Python Dictionary.
import requests
from bs4 import BeautifulSoup
import json
page = requests.get('https://wis-tns.weizmann.ac.il/object/2020aclx')
htmlpage = BeautifulSoup(page.text, 'html.parser')
scriptelements = htmlpage.find_all('script')
for script in scriptelements:
if 'CDATA' in script.text and 'jQuery' in script.text:
scriptcontent = script.text.replace('<!--//--><![CDATA[//>', '').replace('<!--', '').replace('//--><!]]>', '').replace('jQuery.extend(Drupal.settings,', '').replace(');', '')
break
jsondata = json.loads(scriptcontent)
print(jsondata['objectFlot']['plotMain1']['params']['redshift'])

how do I get a Specific part of a web attribute in python with selenium?

I need to get that part after window.open ('/echipa/lok-moscova/Sjs63WfK') as a string
from this web element with selenium and I don't really know how to do it . if I can do it.
Lok. Moscova

Here is example with BeautifulSoup (you can create soup object from selenium page source):
import re
from bs4 import BeautifulSoup
txt = '''
Lok. Moscova
'''
soup = BeautifulSoup(txt, 'html.parser')
link = soup.select_one('a.participant-imglink[onclick]')
url = re.search(r"window\.open\('(.*?)'\)", link['onclick']).group(1)
print(url)
Prints:
/echipa/lok-moscova/Sjs63WfK

You need to find the element in selenium. The easiest way is by id, by you can search by a lot of things (check more here).
linkElement = driver.findElement(By.id("id"))
Next, you can extract attribute as a string
text = linkElement.getAttribute("onclick");
and remove obsolete parts
text = text.replace("window.open(", "").replace(")", "")
and that would be your "/echipa/lok-moscova/Sjs63WfK"

Getting output 0 even if there are 25 same class

Image of the HTML
Link to the page
I am trying to see how many of class are there on this page but the output is 0. And I have been using BeautifulSoup for a while but never saw such error.
from bs4 import BeautifulSoup
import requests
result = requests.get("https://www.holonis.com/motivationquotes")
c = result.content
soup = BeautifulSoup(c)
samples = soup.findAll("div", {"class": "ng-scope"})
print(len(samples))
Output
0
and I want the correct output at least more than 25

This is a "dynamic" Angular-based page which needs a Javascript engine or a browser to be fully loaded. To put it differently - the HTML source code you see in the browser developer tools is not the same as you would see in the result.content - the latter is a non-rendered initial HTML of the page which does not contain the desired data.
You can use things like selenium to have the page rendered and loaded and then HTML-parse it, but, why don't make a direct request to the site API:
import requests
result = requests.get("https://www.holonis.com/api/v2/activities/motivationquotes/all?limit=15&page=0")
data = result.json()
for post in data["items"]:
print(post["body"]["description"])
Post descriptions are retrieved and printed for example-purposes only - the post dictionaries contain all the other relevant post data that is displayed on the web-page.

Basically, result.content does not contain any divs with ng-scope class. As stated in one of the comments the html you are trying to get is added there due to the javascript running on the browser.
I recommend you this package requests-html created by the author of very popular requests.
You can try to use the code below to build on that.
from requests_html import HTMLSession
session = HTMLSession()
r = session.get('https://www.holonis.com/motivationquotes')
r.html.render()
To see how many ng-scope classes are there just do this:
>>> len(r.html.find('.ng-scope'))
302
I assume you want to scrape all the hrefs from the a tags that are children of the divs you gave the image to. You can obtain them this way:
divs = r.html.find('[ng-if="!isVideo"]')
link_sets = (div.absolute_links for div in divs)
>>> list(set(chain.from_iterable(link_sets)))
['https://www.holonis.com/motivationquotes/o/byiqe-ydm',
'https://www.holonis.com/motivationquotes/o/rkhv0uq9f',
'https://www.holonis.com/motivationquotes/o/ry7ra2ycg',
...
'https://www.holonis.com/motivationquotes/o/sydzfwgcz',
'https://www.holonis.com/motivationquotes/o/s1eidcdqf']

There's nothing wrong with BeautifulSoup, in fact, the result of your GET request, does not contain any ng-scope text.
You can see the output here:
>>> from bs4 import BeautifulSoup
>>> import requests
>>>
>>> result = requests.get("https://www.holonis.com/motivationquotes")
>>> c = result.content
>>>
>>> print(c)
**Verify the output yourself**

You only have ng-cloak class as you can see from the regex result:
import re
regex = re.compile('ng.*')
samples = soup.findAll("div", {"class": regex})
samples
#[<div class="ng-cloak slideRoute" data-ui-view="main" fixed="400" main-content="" ng-class="'{{$state.$current.animateClass}}'"></div>]

To get the content of that webpage, it is either wise to use their api or choose any browser simulator like selenium. That webpage loads it's content using lazyload. When you scroll down you will see more content. The webpage expands its content through pagination like https://www.holonis.com/excellenceaddiction/1. However, you can give this a go. I've created this script to parse content displayed within 4 pages. You can always change that page number to satisfy your requirement.
from selenium import webdriver
URL = "https://www.holonis.com/excellenceaddiction/{}"
driver = webdriver.Chrome() #If necessary, define the path
for link in [URL.format(page) for page in range(1,4)]:
driver.get(link)
for items in driver.find_elements_by_css_selector('.tile-content .tile-content-text'):
print(items.text)
driver.quit()
Btw, the above script parses the description of each post.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Web scraping 'window' object - python

Related

Why is my parsed image link comming out in base64 format

Web scraping with Python and beautifulsoup: What is saved by the BeautifulSoup function?

How to get CData from html using beautiful soup

how do I get a Specific part of a web attribute in python with selenium?

Getting output 0 even if there are 25 same class

Categories

Resources