python crawling text from <em></em>

python crawling text from <em></em> - python

Hi, I want to get the text(number 18) from em tag as shown in the picture above.
When I ran my code, it did not work and gave me only empty list. Can anyone help me? Thank you~
here is my code.
from urllib.request import urlopen
from bs4 import BeautifulSoup
url = 'https://blog.naver.com/kwoohyun761/221945923725'
html = urlopen(url)
soup = BeautifulSoup(html, 'lxml')
likes = soup.find_all('em', class_='u_cnt _count')
print(likes)

When you disable javascript you'll see that the like count is loaded dynamically, so you have to use a service that renders the website and then you can parse the content.
You can use an API: https://www.scraperapi.com/
Or run your own for example: https://github.com/scrapinghub/splash
EDIT:
First of all, I missed that you were using urlopen incorrectly the correct way is described here: https://docs.python.org/3/howto/urllib2.html . Assuming you are using python3, which seems to be the case judging by the print statement.
Furthermore: looking at the issue again it is a bit more complicated. When you look at the source code of the page it actually loads an iframe and in that iframe you have the actual content: Hit ctrl + u to see the source code of the original url, since the side seems to block the browser context menu.
So in order to achieve your crawling objective you have to first grab the initial page and then grab the page you are interested in:
from urllib.request import urlopen
from bs4 import BeautifulSoup
# original url
url = "https://blog.naver.com/kwoohyun761/221945923725"
with urlopen(url) as response:
html = response.read()
soup = BeautifulSoup(html, 'lxml')
iframe = soup.find('iframe')
# iframe grabbed, construct real url
print(iframe['src'])
real_url = "https://blog.naver.com" + iframe['src']
# do your crawling
with urlopen(real_url) as response:
html = response.read()
soup = BeautifulSoup(html, 'lxml')
likes = soup.find_all('em', class_='u_cnt _count')
print(likes)
You might be able to avoid one round trip by analyzing the original url and the URL in the iframe. At first glance it looked like the iframe url can be constructed from the original url.
You'll still need a rendered version of the iframe url to grab your desired value.
I don't know what this site is about, but it seems they do not want to get crawled maybe you respect that.

Related

Why is my parsed image link comming out in base64 format

i was trying to parse a image link from a website.
When i inspect the link on the website, it is this one :https://static.nike.com/a/images/c_limit,w_592,f_auto/t_product_v1/df7c2668-f714-4ced-9f8f-1f0024f945a9/chaussure-de-basketball-zoom-freak-3-MZpJZF.png but when i parse it with my code the output is data:image/gif;base64,R0lGODlhAQABAIAAAAAAAP///yH5BAEAAAAALAAAAAABAAEAAAIBRAA7.
from bs4 import BeautifulSoup
import requests
source = requests.get('https://www.nike.com/fr/w/hommes-chaussures-nik1zy7ok').text
soup = BeautifulSoup(source, 'lxml')
pair = soup.find('div', class_='product-card__body')
image_scr = pair.find('img', class_='css-1fxh5tw product-card__hero-image')['src']
print(image_scr)
I think the code isn't the issue but i don't know what's causing the link to come out in base64 format. So how could i set the code to render the link as .png ?

What happens?
First at all, take a look into your soup - There is the truth. Website provides not all information static, there are a lot things provided dynamically and also done by the browser -> So requests wont get this info this way.
Workaround
Take a look at the <noscript> next to your selection, it holds a smaller version of the image and is providing the src
Example
from bs4 import BeautifulSoup
import requests
source = requests.get('https://www.nike.com/fr/w/hommes-chaussures-nik1zy7ok').content
soup = BeautifulSoup(source, 'lxml')
pair = soup.find('div', class_='product-card__body')
image_scr = pair.select_one('noscript img.css-1fxh5tw.product-card__hero-image')['src']
print(image_scr)
Output
https://static.nike.com/a/images/c_limit,w_318,f_auto/t_product_v1/df7c2668-f714-4ced-9f8f-1f0024f945a9/chaussure-de-basketball-zoom-freak-3-MZpJZF.png
If you like a "big picture" just replace parameter w_318 with w_1000...
Edit
Concerning your comment - There are a lot more solutions, but still depending on what you like to do with the information and what you gonna work with.
Following approache uses selenium that is unlike requests rendering the website and give you the "right page source" back but also needs more resources then requests:
from bs4 import BeautifulSoup
from selenium import webdriver
driver = webdriver.Chrome('C:\Program Files\ChromeDriver\chromedriver.exe')
driver.get('https://www.nike.com/fr/w/hommes-chaussures-nik1zy7ok')
soup=BeautifulSoup(driver.page_source, 'html.parser')
pair = soup.find('div', class_='product-card__body')
image_scr = pair.select_one('img.css-1fxh5tw.product-card__hero-image')['src']
print(image_scr)
Output
https://static.nike.com/a/images/c_limit,w_592,f_auto/t_product_v1/df7c2668-f714-4ced-9f8f-1f0024f945a9/chaussure-de-basketball-zoom-freak-3-MZpJZF.png

As you want to grab src meaning image data, so downloading data from server using requests, you need to use .content format as follows:
source = requests.get('https://www.nike.com/fr/w/hommes-chaussures-nik1zy7ok').content

BeautifulSoup not returning results of a search on a website

I am trying to get the links to the individual search results on a website (National Gallery of Art). But the link to the search doesn't load the search results. Here is how I try to do it:
url = 'https://www.nga.gov/collection-search-result.html?artist=C%C3%A9zanne%2C%20Paul'
response = requests.get(url)
soup = BeautifulSoup(response.text, 'html.parser')
I can see that the links to the individual results could be found under soup.findAll('a') but they do not appear, instead the last output is a link to empty search result:
https://www.nga.gov/content/ngaweb/collection-search-result.html
How could I get a list of links, the first of which is the first search result (https://www.nga.gov/collection/art-object-page.52389.html), the second is the second search result (https://www.nga.gov/collection/art-object-page.52085.html) etc?

Actually, data is generating from api calls json response. Here is the desired
list of links.
Code:
import requests
import json
url= 'https://www.nga.gov/collection-search-result/jcr:content/parmain/facetcomponent/parList/collectionsearchresu.pageSize__30.pageNumber__1.json?artist=C%C3%A9zanne%2C%20Paul&_=1634762134895'
r = requests.get(url)
for item in r.json()['results']:
url = item['url']
abs_url = f'https://www.nga.gov{url}'
print(abs_url)
Output:
https://www.nga.gov/content/ngaweb/collection/art-object-page.52389.html
https://www.nga.gov/content/ngaweb/collection/art-object-page.52085.html
https://www.nga.gov/content/ngaweb/collection/art-object-page.46577.html
https://www.nga.gov/content/ngaweb/collection/art-object-page.46580.html
https://www.nga.gov/content/ngaweb/collection/art-object-page.46578.html
https://www.nga.gov/content/ngaweb/collection/art-object-page.136014.html
https://www.nga.gov/content/ngaweb/collection/art-object-page.46576.html
https://www.nga.gov/content/ngaweb/collection/art-object-page.53120.html
https://www.nga.gov/content/ngaweb/collection/art-object-page.54129.html
https://www.nga.gov/content/ngaweb/collection/art-object-page.52165.html
https://www.nga.gov/content/ngaweb/collection/art-object-page.46575.html
https://www.nga.gov/content/ngaweb/collection/art-object-page.53122.html
https://www.nga.gov/content/ngaweb/collection/art-object-page.93044.html
https://www.nga.gov/content/ngaweb/collection/art-object-page.66405.html
https://www.nga.gov/content/ngaweb/collection/art-object-page.53119.html
https://www.nga.gov/content/ngaweb/collection/art-object-page.53121.html
https://www.nga.gov/content/ngaweb/collection/art-object-page.46579.html
https://www.nga.gov/content/ngaweb/collection/art-object-page.66406.html
https://www.nga.gov/content/ngaweb/collection/art-object-page.45866.html
https://www.nga.gov/content/ngaweb/collection/art-object-page.53123.html
https://www.nga.gov/content/ngaweb/collection/art-object-page.45867.html
https://www.nga.gov/content/ngaweb/collection/art-object-page.45986.html
https://www.nga.gov/content/ngaweb/collection/art-object-page.45877.html
https://www.nga.gov/content/ngaweb/collection/art-object-page.136025.html
https://www.nga.gov/content/ngaweb/collection/art-object-page.74193.html
https://www.nga.gov/content/ngaweb/collection/art-object-page.74192.html
https://www.nga.gov/content/ngaweb/collection/art-object-page.66486.html
https://www.nga.gov/content/ngaweb/collection/art-object-page.76288.html
https://www.nga.gov/content/ngaweb/collection/art-object-page.76223.html
https://www.nga.gov/content/ngaweb/collection/art-object-page.76268.html

This seems to work for me:
from bs4 import BeautifulSoup
import requests
url = 'https://www.nga.gov/collection-search-result.html?artist=C%C3%A9zanne%2C%20Paul'
response = requests.get(url)
soup = BeautifulSoup(response.text, 'html.parser')
for a in soup.findAll('a'):
print(a['href'])
It returns all of the html a href links.
For the links from the search results specifically, those are loaded via AJAX and you would need to implement something that renders the javascript like headless chrome. You can read about one of the ways to implement this here, which fits your use case very closely. http://theautomatic.net/2019/01/19/scraping-data-from-javascript-webpage-python/
If you want to ask how to render javascript from python and then parse the result, you would need to close this question and open a new one, as it is not scoped correctly as is.

How to scrape link title over many pages and through specified tab

I am having trouble figuring out how to use BeautifulSoup to scrape all 100 link titles on the page since it is under "a href = ....." . I have tried the below code but it returns a blank.
from bs4 import BeautifulSoup
from urllib.request import urlopen
import bs4
url = 'https://www150.statcan.gc.ca/n1/en/type/data?count=100'
page = urlopen(url)
soup = bs4.BeautifulSoup(page,'html.parser')
title = soup.find_all('a')
Additionally, is there a way to ensure I am scraping everything under the "Tables (8898)" tabs? Thanks in advance!
Link:
https://www150.statcan.gc.ca/n1/en/type/data?count=100

The link you provided is loading it's contents with async javascript requests. So when you exec page = urlopen(url) it is only fetching the empty HTML and javascript blocks.
You need to use a browser to execute js to load page contents. You can checkout this link to learn how to do it: https://towardsdatascience.com/web-scraping-using-selenium-python-8a60f4cf40ab

Getting output 0 even if there are 25 same class

Image of the HTML
Link to the page
I am trying to see how many of class are there on this page but the output is 0. And I have been using BeautifulSoup for a while but never saw such error.
from bs4 import BeautifulSoup
import requests
result = requests.get("https://www.holonis.com/motivationquotes")
c = result.content
soup = BeautifulSoup(c)
samples = soup.findAll("div", {"class": "ng-scope"})
print(len(samples))
Output
0
and I want the correct output at least more than 25

This is a "dynamic" Angular-based page which needs a Javascript engine or a browser to be fully loaded. To put it differently - the HTML source code you see in the browser developer tools is not the same as you would see in the result.content - the latter is a non-rendered initial HTML of the page which does not contain the desired data.
You can use things like selenium to have the page rendered and loaded and then HTML-parse it, but, why don't make a direct request to the site API:
import requests
result = requests.get("https://www.holonis.com/api/v2/activities/motivationquotes/all?limit=15&page=0")
data = result.json()
for post in data["items"]:
print(post["body"]["description"])
Post descriptions are retrieved and printed for example-purposes only - the post dictionaries contain all the other relevant post data that is displayed on the web-page.

Basically, result.content does not contain any divs with ng-scope class. As stated in one of the comments the html you are trying to get is added there due to the javascript running on the browser.
I recommend you this package requests-html created by the author of very popular requests.
You can try to use the code below to build on that.
from requests_html import HTMLSession
session = HTMLSession()
r = session.get('https://www.holonis.com/motivationquotes')
r.html.render()
To see how many ng-scope classes are there just do this:
>>> len(r.html.find('.ng-scope'))
302
I assume you want to scrape all the hrefs from the a tags that are children of the divs you gave the image to. You can obtain them this way:
divs = r.html.find('[ng-if="!isVideo"]')
link_sets = (div.absolute_links for div in divs)
>>> list(set(chain.from_iterable(link_sets)))
['https://www.holonis.com/motivationquotes/o/byiqe-ydm',
'https://www.holonis.com/motivationquotes/o/rkhv0uq9f',
'https://www.holonis.com/motivationquotes/o/ry7ra2ycg',
...
'https://www.holonis.com/motivationquotes/o/sydzfwgcz',
'https://www.holonis.com/motivationquotes/o/s1eidcdqf']

There's nothing wrong with BeautifulSoup, in fact, the result of your GET request, does not contain any ng-scope text.
You can see the output here:
>>> from bs4 import BeautifulSoup
>>> import requests
>>>
>>> result = requests.get("https://www.holonis.com/motivationquotes")
>>> c = result.content
>>>
>>> print(c)
**Verify the output yourself**

You only have ng-cloak class as you can see from the regex result:
import re
regex = re.compile('ng.*')
samples = soup.findAll("div", {"class": regex})
samples
#[<div class="ng-cloak slideRoute" data-ui-view="main" fixed="400" main-content="" ng-class="'{{$state.$current.animateClass}}'"></div>]

To get the content of that webpage, it is either wise to use their api or choose any browser simulator like selenium. That webpage loads it's content using lazyload. When you scroll down you will see more content. The webpage expands its content through pagination like https://www.holonis.com/excellenceaddiction/1. However, you can give this a go. I've created this script to parse content displayed within 4 pages. You can always change that page number to satisfy your requirement.
from selenium import webdriver
URL = "https://www.holonis.com/excellenceaddiction/{}"
driver = webdriver.Chrome() #If necessary, define the path
for link in [URL.format(page) for page in range(1,4)]:
driver.get(link)
for items in driver.find_elements_by_css_selector('.tile-content .tile-content-text'):
print(items.text)
driver.quit()
Btw, the above script parses the description of each post.

get all links from html even with show more link

I am using python and beautifulsoup for html parsing.
I am using the following code :
from BeautifulSoup import BeautifulSoup
import urllib2
import re
url = "http://www.wikipathways.org//index.php?query=signal+transduction+pathway&species=Homo+sapiens&title=Special%3ASearchPathways&doSearch=1&ids=&codes=&type=query"
main_url = urllib2.urlopen(url)
content = main_url.read()
soup = BeautifulSoup(content)
for a in soup.findAll('a',href=True):
print a[href]
but I am not getting output links like :
http://www.wikipathways.org/index.php/Pathway:WP26
and also imp thing is there are 107 pathways. but I will not get all the links as other lins depends on "show links" at the bottom of the page.
so, how can I get all the links (107 links) from that url?

Your problem is line 8, content = url.read(). You're not actually reading the webpage, you're actually just doing nothing (If anything, you should be getting an error).
main_url is what you want to read, so change line 8 to:
content = main_url.read()
You also have another error, print a[href]. href should be a string, so it should be:
print a['href']

I would suggest using lxml its faster and better for parsing html worth investing the time to learn it.
from lxml.html import parse
dom = parse('http://www.wikipathways.org//index.php?query=signal+transduction+pathway&species=Homo+sapiens&title=Special%3ASearchPathways&doSearch=1&ids=&codes=&type=query').getroot()
links = dom.cssselect('a')
That should get you going.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

python crawling text from <em></em> - python

Related

Why is my parsed image link comming out in base64 format

BeautifulSoup not returning results of a search on a website

How to scrape link title over many pages and through specified tab

Getting output 0 even if there are 25 same class

get all links from html even with show more link

Categories

Resources