Bypass cookie agreement page while web scraping using Python - python

I am facing an issue with google agreement page cookies after scraping on a redirect google url.
I am trying to scrape from different pages on Google News uri, but when i run this code:
req = requests.get(url,headers=headers)
with "headers" = {'User-Agent': 'Mozilla/5.0 (Macintosh; U; Intel Mac OS X 10_5_8; en-US) AppleWebKit/534.1 (KHTML, like Gecko) Chrome/6.0.422.0 Safari/534.1', 'Upgrade-Insecure-Requests': '1', 'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8', 'DNT': '1', 'Accept-Encoding': 'gzip, deflate', 'Accept-Language': 'it-IT'}
and for example URL = https://news.google.com/./articles/CAIiEMb3PYSjFFVbudiidQPL79QqGQgEKhAIACoHCAow-ImTCzDRqagDMKiIvgY?hl=it&gl=IT&ceid=IT%3Ait
the "request.content" is the HTMLs code of agreement cookies page by Google.
I have tried also to convert the redirect link into a normal link but the response gives me the redirect link to this
I have the same problem related to this question (How can I bypass a cookie agreement page while web scraping using Python?).
Anyway, the solution proposed in that works only for the specific site.
Note: the entire code worked until few weeks ago.

I solved the problem by adding the line
'Cookie':'CONSENT=YES+cb.20210418-17-p0.it+FX+917; '
to the request header.
Although the page returned by the request is still a Google page, but that page contains the link to the site from which the request originated.
So, once I got the page I did some more scraping so that I could get the link and start the request I wanted.

Related

BeautifulSoup returning 'None'

I am trying to get a list of reviews from the G2 website using BeautifulSoup. However, for some reason, when I run the code below, it says that 'reviews' is 'NoneType'. I can't figure this out because it clearly shows the class name in the HTML from the website (see the picture below). I have used this exact syntax to webscrape from other sites and it has worked so I have no idea why it is returning NoneType. I tried to use 'find_all' and return the length of the list (number of reviews), but that also shows nonetype. I am super confused. Please help!
response = requests.get('https://www.g2.com/products/mailchimp/reviews?filters%5Bcomment_answer_values%5D=&order=most_recent&page=1')
text = BeautifulSoup(response.text, 'html.parser')
num_reviews = 500
reviews = text.find('div', attrs={'class': 'paper paper--white paper--box mb-2 position-relative border-bottom '})
print(reviews)
You need to pass headers to the HTTP request. It's detecting that you're not a browser, if you print the variable text out, you'll see that.
Parsed HTML you get
...
<h1>Pardon Our Interruption...</h1>
<p>
As you were browsing something about your browser made us think you were a bot. There are a few reasons this might happen:
</p>
<ul>
<li>You're a power user moving through this website with super-human speed.</li>
<li>You've disabled JavaScript and/or cookies in your web browser.</li>
<li>A third-party browser plugin, such as Ghostery or NoScript, is preventing JavaScript from running. Additional information is available in this support article.</li>
...
So passing headers, is enough to mimic browser activity.
To grab the headers
Code Example
import requests
headers = {
'authority': 'www.g2.com',
'cache-control': 'max-age=0',
'upgrade-insecure-requests': '1',
'user-agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/84.0.4147.105 Safari/537.36',
'accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3;q=0.9',
'sec-fetch-site': 'none',
'sec-fetch-mode': 'navigate',
'sec-fetch-user': '?1',
'sec-fetch-dest': 'document',
'accept-language': 'en-US,en;q=0.9',
'cookie': '__cfduid=df6514ad701b86146978bf17180a5e6f01597144255; events_distinct_id=822bbff7-912d-4a5e-bd80-4364690b2e06; amplitude_session=1597144258387; _g2_session_id=424bfbe09b254b1a9484f50b70c3381c; reese84=3:BJ8QXTaIa+brQNrbReKzww==:n5v0tg/Q590u2q44+xAi7rnSO1i2Kn7Lp1Ar+2SCMJF5HiBJNqLVR3IPzPF0qIqgxpWjZ9veyhywY4JNSbBOtz5sJOwEecGJE9tT+NInof+vlP3hKTb6bqA3cvAf6cfDIrtEmhI0Dsjoe3ct3NtwvvcA9p8FXHPR7PAFP42nWqAAfDH88vj0hQwWlIjio/fT4g5iDsT1qZH3alC8ZbUhOURKNk9JUz2sBz+RjgkRyctO0VTGzjxmHCd2r40WJqWjVDwRmBl+/msW+/V0PW93vjFs45bMD63D5Q4JeRreBxkAN9ufIajaV0MmkYbxlFnwIZ3cEBHi/X76n+PvAobd5/UgCwgUIvt/P4pl7NEcDWR/ORaZ8gLPl4HbuQaRhEVd23Ez5OBnYFP1wjqLT/ECDkRzQq0Nn8U6qVbMO25Hp6U=:/JrPeXs0AKDQw5FlG3vKQX1dPIsF/TEXTLgQ+mktyAo=; ue-event-segment-983a43a0-1c10-4dfb-96d7-60049c0dcd62=W1siL3VzZXJzL2NvbnNlbnQvc2VsZWN0ZWQiLHsiY29uc2VudF90eXBlIjoi%0AY29va2llcyIsImdyYW50ZWQiOiJ0cnVlIn0sIjk4M2E0M2EwLTFjMTAtNGRm%0AYi05NmQ3LTYwMDQ5YzBkY2Q2MiIsIlVzZXIgQ29uc2VudCBTZWxlY3RlZCIs%0AWyJhbXBsaXR1ZGUiXV1d%0A',
'if-none-match': 'W/"3658e5098c91c183288fd70e6cfd9028"',
}
response = requests.get('https://www.g2.com/products/mailchimp/reviews', headers=headers)
text = BeautifulSoup(response.text, 'html.parser')
num_reviews = 500
reviews = text.select('div[class*="paper paper--white paper--box"]')
print(len(reviews))
Output
25
Explanation
Sometimes in order to make an HTTP request, it's necessary to pass either headers, user agent, cookies, parameters. You can play about with this, I must admit I was lazy and just sent the entire headers. Essentially you're trying to mimic a browser request by using the requests package. Sometimes it's abit more nuanced in detecting a bot.
Here i've inspected the page and gone to network tools. There's a tab called doc. I've then copied that request, by right clicking the request and clicking COPY curl(bash). As I said i'm lazy so I've pasted that into curl.trillworks.com which converts this into a nice python format as well the boilerplate for a request.
I've modified your script slightly as it was quite a long attribute
the CSS selector div[class*=""] grabs any element with class "" you specific.

How to do reverse image search on google by uploading image url?

My goal is to automate google reverse image search.
I would like to upload an image url and get all the website links that include the matching image.
So here is what I could produce so far:
import requests
import bs4
# Let's take a picture of Chicago
chicago = 'https://images.squarespace-cdn.com/content/v1/556e10f5e4b02ae09b8ce47d/1531155504475-KYOOS7EEGVDGMMUQQNX3/ke17ZwdGBToddI8pDm48kCf3-plT4th5YDY7kKLGSZN7gQa3H78H3Y0txjaiv_0fDoOvxcdMmMKkDsyUqMSsMWxHk725yiiHCCLfrh8O1z4YTzHvnKhyp6Da-NYroOW3ZGjoBKy3azqku80C789l0h8vX1l9k24HMAg-S2AFienIXE1YmmWqgE2PN2vVFAwNPldIHIfeNh3oAGoMooVv2g/Chi+edit-24.jpg'
# And let's take google image search uploader by url
googleimage = 'https://www.google.com/searchbyimage?&image_url='
# Here is our Chicago image url uploaded into google image search
url = googleimage+chicago
# And now let's request our Chichago google image search
response = requests.get(url)
soup = bs4.BeautifulSoup(response.text,'html.parser')
# Here is the output
print(soup.prettify())
My problem is that I did not expect this print(soup.prettify())output.
I am not including the output in the post because it's too long.
If you type in your browser:
https://www.google.com/searchbyimage?&image_url=https://images.squarespace-cdn.com/content/v1/556e10f5e4b02ae09b8ce47d/1531155504475-KYOOS7EEGVDGMMUQQNX3/ke17ZwdGBToddI8pDm48kCf3-plT4th5YDY7kKLGSZN7gQa3H78H3Y0txjaiv_0fDoOvxcdMmMKkDsyUqMSsMWxHk725yiiHCCLfrh8O1z4YTzHvnKhyp6Da-NYroOW3ZGjoBKy3azqku80C789l0h8vX1l9k24HMAg-S2AFienIXE1YmmWqgE2PN2vVFAwNPldIHIfeNh3oAGoMooVv2g/Chi+edit-24.jpg
You will see that the html code is way different from our output with soup.
I was expecting the soup code to have the final results so I can parse the links I need. Instead I only got some weird functions that I don't really understand.
It seems that google image search is a three step process: first you upload your image, then something happens with weird functions, then you get your final results.
How can I get my final results just like in my browser? So I can parse the html code like usual.
Let me explain for you.
use print(response.history)
And print(response.url
So if it's 200, then you will get a url such as https://www.google.com/search?tbs=sbi:
But if it's 302, then you will get a url such as hhttps://www.google.com/webhp?tbs=sbi:
For 302 that's means that Google detected you as a BOT and therefore it's denied you by webhp = Web Hidden Path which it's convert the request to for robots detection and further analyze by google side.
You can confirm that if you pressed on your link Click Here and check what's will appear on the browser bar.
Which means that you need to consider header part in order to be on right track.
Use the following way.
from bs4 import BeautifulSoup
import requests
headers = {
'Host': 'www.google.com',
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:71.0) Gecko/20100101 Firefox/71.0',
'Accept': '*/*',
'Accept-Language': 'en-US,en;q=0.5',
'Accept-Encoding': 'gzip, deflate, br',
'Referer': 'https://www.google.com/',
'Origin': 'https://www.google.com',
'Connection': 'keep-alive',
'Content-Length': '0',
'TE': 'Trailers'
}
r = requests.get("https://www.google.com/searchbyimage?image_url=https://images.squarespace-cdn.com/content/v1/556e10f5e4b02ae09b8ce47d/1531155504475-KYOOS7EEGVDGMMUQQNX3/ke17ZwdGBToddI8pDm48kCf3-plT4th5YDY7kKLGSZN7gQa3H78H3Y0txjaiv_0fDoOvxcdMmMKkDsyUqMSsMWxHk725yiiHCCLfrh8O1z4YTzHvnKhyp6Da-NYroOW3ZGjoBKy3azqku80C789l0h8vX1l9k24HMAg-S2AFienIXE1YmmWqgE2PN2vVFAwNPldIHIfeNh3oAGoMooVv2g/Chi+edit-24.jpg&encoded_image=&image_content=&filename=&hl=en", headers=headers)
soup = BeautifulSoup(r.text, 'html.parser')
print(soup.prettify)

Python: how to fill out a web form and get the resulting page source

I am trying to write a python script that will scrape http://www.fakenewsai.com/ and tell me whether or not a news article is fake news. I want the script to input a given news article into the website's url input field and hit the submit button. Then, I want to scrape the website to determine whether the article is "fake" or "real" news, as displayed on the website.
I was successful in accomplishing this using selenium and ChromeDriver, but the script was very slow (>2 minutes) and did not run on Heroku (using flask). For reference, here is the code I used:
from selenium import webdriver
import time
def fakeNews(url):
if url.__contains__("https://"):
url = url[8:-1]
if url.__contains__("http://"):
url = url[7:-1]
browser = webdriver.Chrome("static/chromedriver.exe")
browser.get("http://www.fakenewsai.com")
element = browser.find_element_by_id("url")
element.send_keys(url)
button = browser.find_element_by_id("submit")
button.click()
time.sleep(1)
site = "" + browser.page_source
result = ""
if(site[site.index("opacity: 1")-10] == "e"):
result = "Fake News"
else:
result = "Real News"
browser.quit()
return result
print(fakeNews('https://www.nytimes.com/2019/11/02/opinion/sunday/instagram-social-media.html'))
I have attempted to replicate this code using other python libraries, such as mechanicalsoup, pyppeteer, and scrapy. However, as a beginner at python, I have not found much success. I was hoping someone could point me in the right direction with a solution.
For the stated purpose, in my opinion it would be much more simple to analyze the website, understand it's functionality and then automate the browser behavior instead of the user behavior.
Try to hit F12 on your browser while on the website, open the Network tab, paste a URL on the input box and then hit submit, you will see that it sends a HTTP OPTIONS request and then a POST request to a URL. The server then returns a JSON response as a result.
So, you can use Python's request module (docs) to automate the very POST request instead of having a very complex code that simulates clicks and scrapes the result.
A very simple example you can build on is:
import json
import requests
def fake_news():
url = 'https://us-central1-fake-news-ai.cloudfunctions.net/detect/'
payload = {'url': 'https://www.nytimes.com/'}
headers = {'Accept': '*/*', 'Accept-Encoding': 'gzip, deflate, br', 'Accept-Language': 'en-US,en;q=0.5',
'Connection': 'keep-alive', 'Content-Length': '103', 'Content-type': 'application/json; charset=utf-8',
'DNT': '1', 'Host': 'us-central1-fake-news-ai.cloudfunctions.net', 'Origin': 'http://www.fakenewsai.com',
'Referer': 'http://www.fakenewsai.com/', 'TE': 'Trailers',
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:70.0) Gecko/20100101 Firefox/70.0'}
response_json = requests.post(url, data=json.dumps(payload), headers=headers).text
response = json.loads(response_json)
is_fake = int(response['fake'])
if is_fake == 0:
print("Not fake")
elif is_fake == 1:
print("Fake")
else:
print("Invalid response from server")
if __name__ == "__main__":
fake_news()
PS: It would be fair to contact the owner of the website to discuss using his or her infrastructure for your project.
The main slowdown occurs on starting a chrome browser and locating the first URL.
Note that you are launching a browser for each request.
You can launch a browser on the initialization step and only do the automation parts per request.
This will greatly increase the performance.

How to bypass the DDOS attack check in particular site in python?

I am trying to scrape this site but while getting the data from the site its doing a ddos checkup on me, Which it checks for like 5 seconds and then redirects to the same url but the page opens (On normal browser) but in python i am trying to request the same thing its just returning the ddos check up page. Is there any way i can bypass that or any workaround ?
this is my code :
thanks :)
import requests
from urllib2 import build_opener
import time
import json
url = 'https://www.masterani.me/api/anime/63-naruto-shippuuden/detailed'
headers = {'User-Agent': 'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.11 (KHTML, like Gecko) Chrome/23.0.1271.64 Safari/537.11','Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8',
'Accept-Charset': 'ISO-8859-1,utf-8;q=0.7,*;q=0.3',
'Accept-Encoding': 'none',
'Accept-Language': 'en-US,en;q=0.8',
'Connection': 'keep-alive'}
page = requests.get(url, headers = headers)
print page.text
Using a headless browser will work. Use PhantomJS with Selenium webdriver to scrape such sites, or the ones which uses AJAX to load content.
I found these links useful.
https://www.guru99.com/selenium-python.html
https://vocuzi.in/blog/preventing-website-web-scrapers/
Anti ddos solutions usually take into account various parameters when inspecting a request's validity. For example, your geographincal location may be a huge factor: When trying to reproduce your issue I'm getting a 200 response, meaning the anti ddos has decided to allow my code to access the site.
I would suggest the using VPN/proxy service such as this one, or, if this is a system destined for production, I would suggest a paid service as these are much more reliable. Notice that some service are robust enough to block many proxy IPs as well

It is not possible to parse a part of a webpage that is visible when open with browser

I have this strange problem parsing the webpage Herald Sun to get the list of rss from it. When I look at the webpage in the browser, I can see the links with titles. Though, when I used Python and Beautiful Soup to parse the page, the response does not even have the section I would like to parse.
hdr = {'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_9) AppleWebKit/537.71 (KHTML, like Gecko) Version/7.0 Safari/537.71',
'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8',
'Accept-Charset': 'ISO-8859-1,utf-8;q=0.7,*;q=0.3',
'Accept-Encoding': 'none',
'Accept-Language': 'en-US,en;q=0.8',
'Connection': 'keep-alive'}
req = urllib.request.Request("http://www.heraldsun.com.au/help/rss", headers=hdr)
try:
page = urllib.request.urlopen(req)
except urllib.error.HTTPError as e:
print(e.fp.read())
html_doc = page.read()
f = open("Temp/original.html", 'w')
f.write(html_doc.decode('utf-8'))
The written file as you can check, does not have the results in there, so obviously, Beautiful Soup has nothing to do here.
I wonder, how does the webpage enable this protection and how to overcome it? Thanks,
For commercial use, read the terms of services First
There are really not that much information the server know about who is making this request.
Either IP, User-Agent or Cookie... Sometimes the urllib2 will not grab the information that are generated by JavaScript.
JavaScript or Not?
(1) You need to open up the chrome developer and disable the cache and Javascript to make sure that you can still see the information that you want. If you cannot see the information there, you have to use some tool that support Javascript like Selenium or PhantomJS.
However, in this case, your website looks it is not that sophisticated.
User-Agent? Cookie?
(2) Then the problem ends up tuning User-Agent or Cookies. As you have tried before, the user agent seems like not enough. Then it will be the cookie that will play the trick.
As you can see, the first page call actually returns temporarily unavailable and you need to click the rss HTML that with 200 return code. You just need to copy the user-agent and cookies from there and it will work.
Here are the codes how to add cookie using urllib2
import urllib2, bs4, re
opener = urllib2.build_opener()
opener.addheaders = [("User-Agent","Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/31.0.1650.57 Safari/537.36")]
# I omitted the cookie here and you need to copy and paste your own
opener.addheaders.append(('Cookie', 'act-bg-i...eat_uuniq=1; criteo=; pl=true'))
soup = bs4.BeautifulSoup(opener.open("http://www.heraldsun.com.au/help/rss"))
div = soup.find('div', {"id":"content-2"}).find('div', {"class":"group-content"})
for a in div.find_all('a'):
try:
if 'feeds.news' in a['href']:
print a
except:
pass
And here are the outputs:
Breaking News
Top Stories
World News
Victoria and National News
Sport News
...
The site could very likely be serving different content, depending on the User-Agent string in the headers. Websites will often do this for mobile browsers, for example.
Since you're not specifying one, urllib is going to use its default:
By default, the URLopener class sends a User-Agent header of urllib/VVV, where VVV is the urllib version number.
You could try spoofing a common User-Agent string, by following the advice in this question. See What's My User Agent?

Categories

Resources