How to scrape a webpage using Beautiful Soup as if using Chrome? - python

I am getting certain elements in google chrome (Inspect) but not in internet explorer when I view the source of the same web page.
I assume Beautiful Soup uses internet explorer inside? Its results match IE more closely.
However when I use the Inspect feature of chrome, I see certain elements not listed in the source.
Is there a way I can emulate this in Python or using Beautiful Soup?

You can change your user agent's to one of the following:
https://webscraping.com/blog/User-agents/
A Snippet: changing User-Agent forces the page to open different contets (mobile vs Chrome)
from bs4 import BeautifulSoup
import requests
#headers = {'User-Agent': 'Mozilla/5.0 (X11; U; Linux i686; en-US) AppleWebKit/534.3 (KHTML, like Gecko) Chrome/6.0.472.63 Safari/534.3'}
headers = {'User-Agent': 'Mozilla/5.0 (Linux; Android 5.1.1; SM-G928X Build/LMY47X) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/47.0.2526.83 Mobile Safari/537.36'}
result = requests.get("http://derstandard.at", headers=headers)
c = result.content
print result.request.headers
print len(c)
Note: Some websites are protecting them selfes for user-agent spoofing. So not all websites might respond to these frequent jumps.

Related

How to bypass Mod_Security while scraping

I tried running this Python script using BeautifulSoup and requests modules :
from bs4 import BeautifulSoup as bs
import requests
url = 'https://udemyfreecourses.org/
headers = {'UserAgent' : 'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/86.0.4240.193 Safari/537.36'
}
soup = bs(requests.get(url, headers= headers).text, 'lxml')
But when I send this line :
print(soup.get_text())
It doesn't scrape the text data but instead, It returns this output:
Not Acceptable!Not Acceptable!An appropriate representation of the requested resource could not be found on this server. This error was generated by Mod_Security.
I even even used headers when requesting the webpage, so It can looks like a normal navigator, but I'm still getting this message that's preventing me from accessing the real webpage
Note : The webpage is working perfectly on the navigator directly, but It doesn't show much info when I try to scrape it.
Is there any other way than the one I used with headers that can get a perfect valid request from the website and bypass this security called Mod_Security?
Any help would be very very helpful, Thanks.
EDIT: The Dash in "User-Agent" is essential.
Following this Answer https://stackoverflow.com/a/61968635/8106583
headers = {
'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10.12; rv:55.0) Gecko/20100101 Firefox/55.0',
}
Your User-Agent is the problem. This User-Agent works for me.
Also: Your ip might be blocked by now :D

BeautifulSoup is returing none

I am trying to get a value from a website using beautiful soup but it keeps returning none. This is what I have for my code so far
def getmarketchange():
source = requests.get("https://www.coinbase.com/price").text
soup = bs4.BeautifulSoup(source, "lxml")
marketchange = soup.get("MarketHealth__Percent-sc-1a64a42-2.bEMszd")
print(marketchange)
getmarketchange()
and attached is a screenshot of html code I was trying to grab.
Thank you for your help in advance!
Have a look at the HTML source returned from your get() request - it's a CAPTCHA challenge. You won't be able to get to the Coinbase pricing without passing this challenge.
Excerpt:
<h2 class="cf-subheadline"><span data-translate="complete_sec_check">
Please complete the security check to access</span> www.coinbase.com
</h2>
<div class="cf-section cf-highlight cf-captcha-container">
Coinbase is recognizing that the HTTP request isn't coming from a standard browser-based user, and it is challenging the requester. BeautifulSoup doesn't have a way to pass this check on its own.
Passing in User-Agent headers (to mimic a browser request) also doesn't resolve this issue.
For example:
headers = {'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_10_1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/39.0.2171.95 Safari/537.36'}
source = requests.get("https://www.coinbase.com/price", headers=headers).text
You might find better luck with Selenium, although I'm not sure about that.
To prevent a captcha page, try to specify User-Agent header:
import requests
from bs4 import BeautifulSoup
url = 'https://www.coinbase.com/price'
headers = {'User-Agent': 'Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:80.0) Gecko/20100101 Firefox/80.0'}
soup = BeautifulSoup(requests.get(url, headers=headers).content, 'html.parser')
print(soup.select_one('[class^="MarketHealth__Percent"]').text)
Prints:
0.65%

BeautifulSoup isn't working while web scraping Amazon

I'm new to web scraping and i am trying to use basic skills on Amazon. I want to make a code for finding top 10 'Today's Greatest Deals' with prices and rating and other information.
Every time I try to find a specific tag using find() and specifying class it keeps saying 'None'. However the actual HTML has that tag.
On manual scanning i found out half the code of isn't being displayed in the output terminal. The code displayed is half but then the body and html tag do close. Just a huge chunk of code in body tag is missing.
The last line of code displayed is:
<!--[endif]---->
then body tag closes.
Here is the code that i'm trying:
from bs4 import BeautifulSoup as bs
import requests
source = requests.get('https://www.amazon.in/gp/goldbox?ref_=nav_topnav_deals')
soup = bs(source.text, 'html.parser')
print(soup.prettify())
#On printing this it misses some portion of html
article = soup.find('div', class_ = 'a-row dealContainer dealTile')
print(article)
#On printing this it shows 'None'
Ideally, this should give me the code within the div tag, so that i can continue further to get the name of the product. However the output just shows 'None'. And on printing the whole code without tags it is missing a huge chunk of html inside.
And of course the information needed is in the missing html code.
Is Amazon blocking my request? Please help.
The User-Agent request header contains a characteristic string that allows the network protocol peers to identify the application type, operating system, software vendor or software version of the requesting software user agent. Validating User-Agent header on server side is a common operation so be sure to use valid browser’s User-Agent string to avoid getting blocked.
(Source: http://go-colly.org/articles/scraping_related_http_headers/)
The only thing you need to do is to set a legitimate user-agent. Therefore add headers to emulate a browser. :
# This is a standard user-agent of Chrome browser running on Windows 10
headers = { 'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/71.0.3578.98 Safari/537.36' }
Example:
from bs4 import BeautifulSoup
import requests
headers={'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/71.0.3578.98 Safari/537.36'}
resp = requests.get('https://www.amazon.com', headers=headers).text
soup = BeautifulSoup(resp, 'html.parser')
...
<your code here>
Additionally, you can add another set of headers to pretend like a legitimate browser. Add some more headers like this:
headers = {
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/71.0.3578.98 Safari/537.36',
'Accept' : 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8',
'Accept-Language' : 'en-US,en;q=0.5',
'Accept-Encoding' : 'gzip',
'DNT' : '1', # Do Not Track Request Header
'Connection' : 'close'
}

Python3, beautifulsoup, return nothing in specific pages

In some pages, when I use beautifulsoup, return nothing...just blank pages.
from bs4 import BeautifulSoup
import urllib.request
Site = "http://gall.dcinside.com/board/lists/?id=parkbogum&page=2"
URL = Site
html = urllib.request.urlopen(URL).read()
soup = BeautifulSoup(html, "html.parser")
print(soup)
I can use beautifulsoup any other site except this site. and I dont know way...
This URL will require certain headers passed while requesting.
Pass this headers parameter while requesting the URL and you will get the HTML.
HTML = requests.get(URL , headers = headers).content
while
headers = {
"method":"GET",
"user-agent":"Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36
(KHTML, like Gecko) Chrome/60.0.3112.101 Safari/537.36",
"Host":"gall.dcinside.com",
"Pragma":"no-cache",
"Upgrade-Insecure-Requests":"1",
"Accept":"text/html,application/xhtml+xml,
application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8"
}
As I can see, this site is using cookies. You can see the headers in the browser's developer tool. You can get the cookie by following:
import urllib.request
r = urllib.request.urlopen(URL)
ck = r.getheader('Set-Cookie')
Now you can create the header like this and send it with subsequent requests.
headers = {
"Accept": "text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8",
"Cookie": ck,
"User-Agent": "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/61.0.3163.79 Safari/537.36"
}
req = urllib.request.Request(URL, headers=headers)
html = urllib.request.urlopen(req).read()
Some website servers look for robot scripts trying to access their pages. One of the simpler methods of doing this is to check to see which User-Agent is being sent by the browser. In this case as you are using Python and not a web browser, the following is being sent:
python-requests/2.18.4
When it sees an agent it does not like, it will return nothing. To get around this, you need to change the User-Agent string in your request. There are hundreds to choose from, as the agent string changes with each release of a browser. For example see this list of Firefox User-Agent strings e.g.
Mozilla/5.0 (Windows NT 6.1; WOW64; rv:40.0) Gecko/20100101 Firefox/40.1
Mozilla/5.0 (Windows NT 6.3; rv:36.0) Gecko/20100101 Firefox/36.0
The trick is to try a few, and find one that the server is happy with. In your case, ONLY the header needs to be changed in order to get HTML to be returned from the website. In some cases, cookies will also need to be used.
The header can be easily changed by passing a dictionary. This could be done using requests as follows:
from bs4 import BeautifulSoup
import requests
url = "http://gall.dcinside.com/board/lists/?id=parkbogum&page=2"
html = requests.get(url, headers={'User-Agent': 'Mozilla/5.0 (iPad; U; CPU OS 3_2_1 like Mac OS X; en-us) AppleWebKit/531.21.10 (KHTML, like Gecko) Mobile/7B405'}).content
soup = BeautifulSoup(html, "html.parser")
print(soup)

Unable to scrape google news accurately

I'm trying to scrape google headlines for a given keyword (eg. Blackrock) for a given period (eg. 7-jan-2012 to 14-jan-2012).
I'm trying to do this by constructing the url and then using urllib2 as shown in the code below. if I put the constructed url in a browser, it gives me the correct result. however, if I use it through python, I get news results for the right keyword but for the current period.
here'e the code. Can someone tell me what I'm doing wrong and how I can correct it?
import urllib
import urllib2
import json
from bs4 import BeautifulSoup
import requests
url = 'https://www.google.com/search?q=Blackrock&hl=en&gl=uk&authuser=0&source=lnt&tbs=cdr%3A1%2Ccd_min%3A7%2F1%2F2012%2Ccd_max%3A14%2F1%2F2012&tbm=nws'
req = urllib2.Request(url)
req.add_header('User-Agent', 'Mozilla/5.0 (Windows; U; Windows NT 5.1; en-GB; rv:1.9.0.3 Gecko/2008092417 Firefox/3.0.3')
response = urllib2.urlopen(req)
html = response.read()
soup = BeautifulSoup(html)
text = soup.text
start = text.index('000 results')+11
end = text.index('NextThe selection')
text = text[start:end]
print text
The problem is with your user-agent, it works for me with:
req.add_header('User-Agent', 'Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/39.0.2171.71 Safari/537.36')
You are using a user-agent for Firefox 3, which is about 6 years old.

Categories

Resources