I am using Python to scrape newspaper sites and collect the actual story in text after removing various HTML tags etc.
My simple code is as follows
import urllib.request
from bs4 import BeautifulSoup
#targetURL = 'http://indianexpress.com/article/india/mamata-banerjee-army-deployment-nh-2-in-west-bengal-military-coup-4405871'
targetURL = "http://timesofindia.indiatimes.com/india/Congress-Twitter-hacking-Police-form-cyber-team-launch-probe/articleshow/55737598.cms"
#targetURL = 'http://www.telegraphindia.com/1161201/jsp/nation/story_122343.jsp#.WEDzfXV948o'
with urllib.request.urlopen(targetURL) as url:
html = url.read()
soup = BeautifulSoup(html,'lxml')
for el in soup.find_all("p"):
print (el.text)
when I am accessing the indianexpress.com URL or the telegraphindia.com URL, the code is working just fine and I am getting the story, by and large without junk, in pure text form.
however the timesofindia.com site has an adblock blocker and in this case, the output is as follows :
We have noticed that you have an ad blocker enabled which restricts ads served on the site.
Please disable to continue reading.
How can I get bypass this Adblock blocker and retrieve the page? Will be grateful for any suggestions
It looks like the actual content you're trying to extract isn't inside of <p> tags. However, the ad blocker warning is inside such tags. This text is always part of the HTML document but only made visible to users if ads fail to load.
Try extracting the contents of the <arttextxml> tag instead.
Related
So I am trying to create a small code that gets the views from a youtube video and prints them. However using this code when printing the text var I just get the response "None". Is there a way to get a response of the actual view count using these libraries?
import requests
from bs4 import BeautifulSoup
url = requests.get("https://www.youtube.com/watch?v=dQw4w9WgXcQ")
soup = BeautifulSoup(url.text, 'html.parser')
text = soup.find('span', {'class': "view-count style-scopeytd-video-view-count-renderer"})
print(text)
To see why, you should use wget or curl to fetch a copy of that page and look at it, or use "view source" from your browser. That's what requests sees. None of those classes appear in the HTML you get back. That's why you get None -- because there ARE none.
YouTube builds all of its pages dynamically, through Javascript. requests doesn't interpret Javascript. If you need to do this, you'll need to use something like Selenium to run a real browser with a Javascript interpreter built in.
I want to print the text inside the class.
This is the HTML snip(It is inside of many classes, But in visual, It is next to Prestige->
<div class="sc-ikPAkQ ceimHt">
9882
</div>
THis is my code->
from bs4 import BeautifulSoup
import requests
URL = "https://auntm.ai/champions/abomination/tier/6"
page = requests.get(URL)
soup = BeautifulSoup(page.content, "html.parser")
for data in soup.findAll('div', attrs={"class": "sc-ikPAkQ ceimHt"}):
print(data)
I want to print the integer 9882 from the class
I tried but I failed.
How do I do so?
Unlike a typical static webpage, the main content of the webpage is loaded dynamically with JavaScript.
That is, the response body (page.content) won't contain all the content you see finally. Instead, upon you accessing the webpage via a Web browser, the browser executing these JavaScript codes which then updates the HTML with data from other data sources (typically, via another API calling or just some hardcoded data in the script itself). In other words, the final HTML shown in the DOM inspector in a Web browser is different from what you gain with requests.get. (You can verify this by printing page.content or clicking the "View Page Source" entry in the right-click menu on the page).
General ways to handle this case are either:
Turn to selenium for help. Selenium is essentially a programmatically controlled Web browser (but without a real window) for JS codes to execute and render the webpage as normal.
Inspect JS codes and/or additional network requests on that page to extract the data source. It requires some experience and knowledge with Web dev or JS.
You can get the text by calling .get("text") i.e
for data in soup.findAll('div', attrs={"class": "sc-ikPAkQ ceimHt"}):
data.get("text")
Check getText() vs text() vs get_text() for different ways of getting the text (and Get text of children in a div with beautifulsoup for answer to your question)
I'm trying to access a website in my work, however it's username/password protected. The user/pw pop-up also looks as in the picture.Login image
I attach my code to view the website.
I can see the HTML code, however with an error "401 Authorization Required".
Can you please help?
import requests
from bs4 import BeautifulSoup as bs
r = requests.get("http://10.75.19.101/mfgindex", auth=('root', 'password'))
# Convert to beautiful soup object
soup = bs(r.content, features="html.parser")
# print
print(soup.prettify())
Generally if site is password-protected you can't obviously bypass the login procedure. That forces you to leverage a RPA process where your code controls the web browser and performs login action leveraging real login and pwd, followed by automated browsing of the pages you need and extraction of the elements you require from HTML using the BeautifulSoup.
For that I suggest to try out Selenium (https://www.selenium.dev/)
A short tutorial is here:
https://medium.com/ymedialabs-innovation/web-scraping-using-beautiful-soup-and-selenium-for-dynamic-page-2f8ad15efe25
I tried it to solve similar task some time ago and it worked good
I'm trying to scrape a website that has a table in it using bs4, but the element of the content I'm getting is not as complete compared to the one I get from inspect. I cannot find the tag <tr> and <td> in it. How can I get the full content of that site especially the tags for the table?
Here's my code:
from bs4 import BeautifulSoup
import requests
link = requests.get("https://pemilu2019.kpu.go.id/#/ppwp/hitung-suara/", verify = False)
src = link.content
soup = BeautifulSoup(src, "html.parser")
print(soup)
I expect the content to have the tag <tr> and <td> in it because they do exist when I inspect it,but I found none from the output.
Here's the image of the page where there is the tag <tr> and <td>
You should dump the contents of the text you're trying to parse to a file and look at it. This will tell you for sure what is and isn't there. Like this:
from bs4 import BeautifulSoup
import requests
link = requests.get("https://pemilu2019.kpu.go.id/#/ppwp/hitung-suara/", verify = False)
src = link.content
with open("/tmp/content.html", "w") as f:
f.write(src)
soup = BeautifulSoup(src, "html.parser")
print(soup)
Run this code, and then look at the file "/tmp/content.html" (use a different path, obviously, if you're on Windows), and look at what is actually in the file. You could probably do this with your browser, but this this is the way to be the most sure you know what you are getting. You could, of course, also just add print(src), but if it were me, I'd dump it to a file
If the HTML you're looking for is not in the initial HTML that you're getting back, then that HTML is coming from somewhere else. The table could be being built dynamically by JavaScript, or coming from another URL reference, possibly one that calls an HTTP API to grab the table's HTML via parameters passed to the API endpoint.
You will have to reverse engineer the site's design to find where that HTML comes from. If it comes from JavaScript, you may be stuck short of scripting the execution of a browser so you can gain access programmatically to the DOM in the browser's memory.
I would recommend running a debugging proxy that will show you each HTTP request being made by your browser. You'll be able to see the contents of each request and response. If you can do this, you can find the URL that actually returns the content you're looking for, if such a URL exists. You'll have to deal with SSL certificates and such because this is a https endpoint. Debugging proxies usually make that pretty easy. We use Charles. The standard browser toolboxes might do this too...allow you to see each request and response that is generated by a particular page load.
If you can discover the URL that actually returns the table HTML, then you can use that URL to grab it and parse it with BS.
I've been trying to create a simple web scraper program to scrape the book titles of a 100 bestseller list on Amazon. I've used this code before on another site with no problems. But for some reason, it scraps the first page fine but then posts the same results for the following iterations.
I'm not sure if it's something to do with how Amazon creates its urls or not. When I manually enter the "#2" (and beyond) at the end of the url in the browser it navigates fine.
(Once the scrape is working I plan on dumping the data in csv files. But for now, print to the terminal will do.)
import requests
from bs4 import BeautifulSoup
for i in range(5):
url = "https://smile.amazon.com/Best-Sellers-Kindle-Store-Dystopian-Science-Fiction/zgbs/digital-text/6361470011/ref=zg_bs_nav_kstore_4_158591011#{}".format(i)
r = requests.get(url)
soup = BeautifulSoup(r.content, "lxml")
for book in soup.find_all('div', class_='zg_itemWrapper'):
title = book.find('div', class_='p13n-sc-truncate')
name = book.find('a', class_='a-link-child')
price = book.find('span', class_='p13n-sc-price')
print(title)
print(name)
print(price)
print("END")
This is a common problem that you have to face, some sites load the data asynchronous(with ajax) those are XMLHttpRequest that you can see in the tab networking of your DOM inspector. Usually the websites load the data from a different endpoint with POST method to solve that you can use urllib or requests library.
In this case the request is through a GET method and you can scrape it from this url with no need of extend your code https://www.amazon.com/Best-Sellers-Kindle-Store-Dystopian-Science-Fiction/zgbs/digital-text/6361470011/ref=zg_bs_pg_3?_encoding=UTF8&pg=3&ajax=1 where you only change the pg parameter