Printing contents from class using BeautifulSoup - python

I want to print the text inside the class.
This is the HTML snip(It is inside of many classes, But in visual, It is next to Prestige->
<div class="sc-ikPAkQ ceimHt">
9882
</div>
THis is my code->
from bs4 import BeautifulSoup
import requests
URL = "https://auntm.ai/champions/abomination/tier/6"
page = requests.get(URL)
soup = BeautifulSoup(page.content, "html.parser")
for data in soup.findAll('div', attrs={"class": "sc-ikPAkQ ceimHt"}):
print(data)
I want to print the integer 9882 from the class
I tried but I failed.
How do I do so?

Unlike a typical static webpage, the main content of the webpage is loaded dynamically with JavaScript.
That is, the response body (page.content) won't contain all the content you see finally. Instead, upon you accessing the webpage via a Web browser, the browser executing these JavaScript codes which then updates the HTML with data from other data sources (typically, via another API calling or just some hardcoded data in the script itself). In other words, the final HTML shown in the DOM inspector in a Web browser is different from what you gain with requests.get. (You can verify this by printing page.content or clicking the "View Page Source" entry in the right-click menu on the page).
General ways to handle this case are either:
Turn to selenium for help. Selenium is essentially a programmatically controlled Web browser (but without a real window) for JS codes to execute and render the webpage as normal.
Inspect JS codes and/or additional network requests on that page to extract the data source. It requires some experience and knowledge with Web dev or JS.

You can get the text by calling .get("text") i.e
for data in soup.findAll('div', attrs={"class": "sc-ikPAkQ ceimHt"}):
data.get("text")
Check getText() vs text() vs get_text() for different ways of getting the text (and Get text of children in a div with beautifulsoup for answer to your question)

Related

How do I fix fix getting "None" as a response when web scraping?

So I am trying to create a small code that gets the views from a youtube video and prints them. However using this code when printing the text var I just get the response "None". Is there a way to get a response of the actual view count using these libraries?
import requests
from bs4 import BeautifulSoup
url = requests.get("https://www.youtube.com/watch?v=dQw4w9WgXcQ")
soup = BeautifulSoup(url.text, 'html.parser')
text = soup.find('span', {'class': "view-count style-scopeytd-video-view-count-renderer"})
print(text)
To see why, you should use wget or curl to fetch a copy of that page and look at it, or use "view source" from your browser. That's what requests sees. None of those classes appear in the HTML you get back. That's why you get None -- because there ARE none.
YouTube builds all of its pages dynamically, through Javascript. requests doesn't interpret Javascript. If you need to do this, you'll need to use something like Selenium to run a real browser with a Javascript interpreter built in.

How can I view web page content that is generated using angular JS?

I have an app that uses requests to analyze and act on web page text. But it does not seem to work on this page that is likely built with angular: https://bio.tools/bowtie, in that the source HTML is different than the actual content. I am trying to collect the DOI that is referenced on the page (10.1186/gb-2009-10-3-r25), but when requests picks up the HTML source the DOI is not there.
I've heard that Google is able to parse pages that are generated using javascript. How do they do it? Any tips on viewing the DOI information with python?
You probably need an engine which runs the javascript of the http response for you (like an internet browser does). You can use selenium for this and then parsing the html it returns with beautifulsoup.
Example:
from selenium import webdriver
from bs4 import BeautifulSoup
url = "https://bio.tools/bowtie"
path = "path/to/chrome/webdriver"
browser = webdriver.Chrome(path) # Can also be Firefox, etc.
browser.get(url)
html = browser.page_source
soup = BeautifulSoup(html, 'html.parser')
...

How to take table data from a website using bs4

I'm trying to scrape a website that has a table in it using bs4, but the element of the content I'm getting is not as complete compared to the one I get from inspect. I cannot find the tag <tr> and <td> in it. How can I get the full content of that site especially the tags for the table?
Here's my code:
from bs4 import BeautifulSoup
import requests
link = requests.get("https://pemilu2019.kpu.go.id/#/ppwp/hitung-suara/", verify = False)
src = link.content
soup = BeautifulSoup(src, "html.parser")
print(soup)
I expect the content to have the tag <tr> and <td> in it because they do exist when I inspect it,but I found none from the output.
Here's the image of the page where there is the tag <tr> and <td>
You should dump the contents of the text you're trying to parse to a file and look at it. This will tell you for sure what is and isn't there. Like this:
from bs4 import BeautifulSoup
import requests
link = requests.get("https://pemilu2019.kpu.go.id/#/ppwp/hitung-suara/", verify = False)
src = link.content
with open("/tmp/content.html", "w") as f:
f.write(src)
soup = BeautifulSoup(src, "html.parser")
print(soup)
Run this code, and then look at the file "/tmp/content.html" (use a different path, obviously, if you're on Windows), and look at what is actually in the file. You could probably do this with your browser, but this this is the way to be the most sure you know what you are getting. You could, of course, also just add print(src), but if it were me, I'd dump it to a file
If the HTML you're looking for is not in the initial HTML that you're getting back, then that HTML is coming from somewhere else. The table could be being built dynamically by JavaScript, or coming from another URL reference, possibly one that calls an HTTP API to grab the table's HTML via parameters passed to the API endpoint.
You will have to reverse engineer the site's design to find where that HTML comes from. If it comes from JavaScript, you may be stuck short of scripting the execution of a browser so you can gain access programmatically to the DOM in the browser's memory.
I would recommend running a debugging proxy that will show you each HTTP request being made by your browser. You'll be able to see the contents of each request and response. If you can do this, you can find the URL that actually returns the content you're looking for, if such a URL exists. You'll have to deal with SSL certificates and such because this is a https endpoint. Debugging proxies usually make that pretty easy. We use Charles. The standard browser toolboxes might do this too...allow you to see each request and response that is generated by a particular page load.
If you can discover the URL that actually returns the table HTML, then you can use that URL to grab it and parse it with BS.

Adblock blocker blocking urllib.request.urlopen()

I am using Python to scrape newspaper sites and collect the actual story in text after removing various HTML tags etc.
My simple code is as follows
import urllib.request
from bs4 import BeautifulSoup
#targetURL = 'http://indianexpress.com/article/india/mamata-banerjee-army-deployment-nh-2-in-west-bengal-military-coup-4405871'
targetURL = "http://timesofindia.indiatimes.com/india/Congress-Twitter-hacking-Police-form-cyber-team-launch-probe/articleshow/55737598.cms"
#targetURL = 'http://www.telegraphindia.com/1161201/jsp/nation/story_122343.jsp#.WEDzfXV948o'
with urllib.request.urlopen(targetURL) as url:
html = url.read()
soup = BeautifulSoup(html,'lxml')
for el in soup.find_all("p"):
print (el.text)
when I am accessing the indianexpress.com URL or the telegraphindia.com URL, the code is working just fine and I am getting the story, by and large without junk, in pure text form.
however the timesofindia.com site has an adblock blocker and in this case, the output is as follows :
We have noticed that you have an ad blocker enabled which restricts ads served on the site.
Please disable to continue reading.
How can I get bypass this Adblock blocker and retrieve the page? Will be grateful for any suggestions
It looks like the actual content you're trying to extract isn't inside of <p> tags. However, the ad blocker warning is inside such tags. This text is always part of the HTML document but only made visible to users if ads fail to load.
Try extracting the contents of the <arttextxml> tag instead.

CSS selectors to be used for scraping specific links

I am new to Python and working on a scraping project. I am using Firebug to copy the CSS path of required links. I am trying to collect the links under the tab of "UPCOMING EVENTS" from http://kiascenehai.pk/ but it is just for learning how I can get the specified links.
I am looking for the fix of this problem and also suggestions for how to retrieve specified links using CSS selectors.
from bs4 import BeautifulSoup
import requests
url = "http://kiascenehai.pk/"
r = requests.get(url)
data = r.text
soup = BeautifulSoup(data)
for link in soup.select("html body div.body-outer-wrapper div.body-wrapper.boxed-mode div.main- outer-wrapper.mt30 div.main-wrapper.container div.row.row-wrapper div.page-wrapper.twelve.columns.b0 div.row div.page-wrapper.twelve.columns div.row div.eight.columns.b0 div.content.clearfix section#main-content div.row div.six.columns div.small-post-wrapper div.small-post-content h2.small-post-title a"):
print link.get('href')
First of all, that page requires a city selection to be made (in a cookie). Use a Session object to handle this:
s = requests.Session()
s.post('http://kiascenehai.pk/select_city/submit_city', data={'city': 'Lahore'})
response = s.get('http://kiascenehai.pk/')
Now the response gets the actual page content, not redirected to the city selection page.
Next, keep your CSS selector no larger than needed. In this page there isn't much to go on as it uses a grid layout, so we first need to zoom in on the right rows:
upcoming_events_header = soup.find('div', class_='featured-event')
upcoming_events_row = upcoming_events_header.find_next(class_='row')
for link in upcoming_events_row.select('h2 a[href]'):
print link['href']
This is co-founder KiaSceneHai.pk; please don't scrape websites, alot of effort goes into collecting the data, we offer access through our API, you can use the contact form to request access, ty

Categories

Resources