My goal is to automate google reverse image search.
I would like to upload an image url and get all the website links that include the matching image.
So here is what I could produce so far:
import requests
import bs4
# Let's take a picture of Chicago
chicago = 'https://images.squarespace-cdn.com/content/v1/556e10f5e4b02ae09b8ce47d/1531155504475-KYOOS7EEGVDGMMUQQNX3/ke17ZwdGBToddI8pDm48kCf3-plT4th5YDY7kKLGSZN7gQa3H78H3Y0txjaiv_0fDoOvxcdMmMKkDsyUqMSsMWxHk725yiiHCCLfrh8O1z4YTzHvnKhyp6Da-NYroOW3ZGjoBKy3azqku80C789l0h8vX1l9k24HMAg-S2AFienIXE1YmmWqgE2PN2vVFAwNPldIHIfeNh3oAGoMooVv2g/Chi+edit-24.jpg'
# And let's take google image search uploader by url
googleimage = 'https://www.google.com/searchbyimage?&image_url='
# Here is our Chicago image url uploaded into google image search
url = googleimage+chicago
# And now let's request our Chichago google image search
response = requests.get(url)
soup = bs4.BeautifulSoup(response.text,'html.parser')
# Here is the output
print(soup.prettify())
My problem is that I did not expect this print(soup.prettify())output.
I am not including the output in the post because it's too long.
If you type in your browser:
https://www.google.com/searchbyimage?&image_url=https://images.squarespace-cdn.com/content/v1/556e10f5e4b02ae09b8ce47d/1531155504475-KYOOS7EEGVDGMMUQQNX3/ke17ZwdGBToddI8pDm48kCf3-plT4th5YDY7kKLGSZN7gQa3H78H3Y0txjaiv_0fDoOvxcdMmMKkDsyUqMSsMWxHk725yiiHCCLfrh8O1z4YTzHvnKhyp6Da-NYroOW3ZGjoBKy3azqku80C789l0h8vX1l9k24HMAg-S2AFienIXE1YmmWqgE2PN2vVFAwNPldIHIfeNh3oAGoMooVv2g/Chi+edit-24.jpg
You will see that the html code is way different from our output with soup.
I was expecting the soup code to have the final results so I can parse the links I need. Instead I only got some weird functions that I don't really understand.
It seems that google image search is a three step process: first you upload your image, then something happens with weird functions, then you get your final results.
How can I get my final results just like in my browser? So I can parse the html code like usual.
Let me explain for you.
use print(response.history)
And print(response.url
So if it's 200, then you will get a url such as https://www.google.com/search?tbs=sbi:
But if it's 302, then you will get a url such as hhttps://www.google.com/webhp?tbs=sbi:
For 302 that's means that Google detected you as a BOT and therefore it's denied you by webhp = Web Hidden Path which it's convert the request to for robots detection and further analyze by google side.
You can confirm that if you pressed on your link Click Here and check what's will appear on the browser bar.
Which means that you need to consider header part in order to be on right track.
Use the following way.
from bs4 import BeautifulSoup
import requests
headers = {
'Host': 'www.google.com',
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:71.0) Gecko/20100101 Firefox/71.0',
'Accept': '*/*',
'Accept-Language': 'en-US,en;q=0.5',
'Accept-Encoding': 'gzip, deflate, br',
'Referer': 'https://www.google.com/',
'Origin': 'https://www.google.com',
'Connection': 'keep-alive',
'Content-Length': '0',
'TE': 'Trailers'
}
r = requests.get("https://www.google.com/searchbyimage?image_url=https://images.squarespace-cdn.com/content/v1/556e10f5e4b02ae09b8ce47d/1531155504475-KYOOS7EEGVDGMMUQQNX3/ke17ZwdGBToddI8pDm48kCf3-plT4th5YDY7kKLGSZN7gQa3H78H3Y0txjaiv_0fDoOvxcdMmMKkDsyUqMSsMWxHk725yiiHCCLfrh8O1z4YTzHvnKhyp6Da-NYroOW3ZGjoBKy3azqku80C789l0h8vX1l9k24HMAg-S2AFienIXE1YmmWqgE2PN2vVFAwNPldIHIfeNh3oAGoMooVv2g/Chi+edit-24.jpg&encoded_image=&image_content=&filename=&hl=en", headers=headers)
soup = BeautifulSoup(r.text, 'html.parser')
print(soup.prettify)
Related
I am facing an issue with google agreement page cookies after scraping on a redirect google url.
I am trying to scrape from different pages on Google News uri, but when i run this code:
req = requests.get(url,headers=headers)
with "headers" = {'User-Agent': 'Mozilla/5.0 (Macintosh; U; Intel Mac OS X 10_5_8; en-US) AppleWebKit/534.1 (KHTML, like Gecko) Chrome/6.0.422.0 Safari/534.1', 'Upgrade-Insecure-Requests': '1', 'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8', 'DNT': '1', 'Accept-Encoding': 'gzip, deflate', 'Accept-Language': 'it-IT'}
and for example URL = https://news.google.com/./articles/CAIiEMb3PYSjFFVbudiidQPL79QqGQgEKhAIACoHCAow-ImTCzDRqagDMKiIvgY?hl=it&gl=IT&ceid=IT%3Ait
the "request.content" is the HTMLs code of agreement cookies page by Google.
I have tried also to convert the redirect link into a normal link but the response gives me the redirect link to this
I have the same problem related to this question (How can I bypass a cookie agreement page while web scraping using Python?).
Anyway, the solution proposed in that works only for the specific site.
Note: the entire code worked until few weeks ago.
I solved the problem by adding the line
'Cookie':'CONSENT=YES+cb.20210418-17-p0.it+FX+917; '
to the request header.
Although the page returned by the request is still a Google page, but that page contains the link to the site from which the request originated.
So, once I got the page I did some more scraping so that I could get the link and start the request I wanted.
I am trying to get a list of reviews from the G2 website using BeautifulSoup. However, for some reason, when I run the code below, it says that 'reviews' is 'NoneType'. I can't figure this out because it clearly shows the class name in the HTML from the website (see the picture below). I have used this exact syntax to webscrape from other sites and it has worked so I have no idea why it is returning NoneType. I tried to use 'find_all' and return the length of the list (number of reviews), but that also shows nonetype. I am super confused. Please help!
response = requests.get('https://www.g2.com/products/mailchimp/reviews?filters%5Bcomment_answer_values%5D=&order=most_recent&page=1')
text = BeautifulSoup(response.text, 'html.parser')
num_reviews = 500
reviews = text.find('div', attrs={'class': 'paper paper--white paper--box mb-2 position-relative border-bottom '})
print(reviews)
You need to pass headers to the HTTP request. It's detecting that you're not a browser, if you print the variable text out, you'll see that.
Parsed HTML you get
...
<h1>Pardon Our Interruption...</h1>
<p>
As you were browsing something about your browser made us think you were a bot. There are a few reasons this might happen:
</p>
<ul>
<li>You're a power user moving through this website with super-human speed.</li>
<li>You've disabled JavaScript and/or cookies in your web browser.</li>
<li>A third-party browser plugin, such as Ghostery or NoScript, is preventing JavaScript from running. Additional information is available in this support article.</li>
...
So passing headers, is enough to mimic browser activity.
To grab the headers
Code Example
import requests
headers = {
'authority': 'www.g2.com',
'cache-control': 'max-age=0',
'upgrade-insecure-requests': '1',
'user-agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/84.0.4147.105 Safari/537.36',
'accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3;q=0.9',
'sec-fetch-site': 'none',
'sec-fetch-mode': 'navigate',
'sec-fetch-user': '?1',
'sec-fetch-dest': 'document',
'accept-language': 'en-US,en;q=0.9',
'cookie': '__cfduid=df6514ad701b86146978bf17180a5e6f01597144255; events_distinct_id=822bbff7-912d-4a5e-bd80-4364690b2e06; amplitude_session=1597144258387; _g2_session_id=424bfbe09b254b1a9484f50b70c3381c; reese84=3:BJ8QXTaIa+brQNrbReKzww==:n5v0tg/Q590u2q44+xAi7rnSO1i2Kn7Lp1Ar+2SCMJF5HiBJNqLVR3IPzPF0qIqgxpWjZ9veyhywY4JNSbBOtz5sJOwEecGJE9tT+NInof+vlP3hKTb6bqA3cvAf6cfDIrtEmhI0Dsjoe3ct3NtwvvcA9p8FXHPR7PAFP42nWqAAfDH88vj0hQwWlIjio/fT4g5iDsT1qZH3alC8ZbUhOURKNk9JUz2sBz+RjgkRyctO0VTGzjxmHCd2r40WJqWjVDwRmBl+/msW+/V0PW93vjFs45bMD63D5Q4JeRreBxkAN9ufIajaV0MmkYbxlFnwIZ3cEBHi/X76n+PvAobd5/UgCwgUIvt/P4pl7NEcDWR/ORaZ8gLPl4HbuQaRhEVd23Ez5OBnYFP1wjqLT/ECDkRzQq0Nn8U6qVbMO25Hp6U=:/JrPeXs0AKDQw5FlG3vKQX1dPIsF/TEXTLgQ+mktyAo=; ue-event-segment-983a43a0-1c10-4dfb-96d7-60049c0dcd62=W1siL3VzZXJzL2NvbnNlbnQvc2VsZWN0ZWQiLHsiY29uc2VudF90eXBlIjoi%0AY29va2llcyIsImdyYW50ZWQiOiJ0cnVlIn0sIjk4M2E0M2EwLTFjMTAtNGRm%0AYi05NmQ3LTYwMDQ5YzBkY2Q2MiIsIlVzZXIgQ29uc2VudCBTZWxlY3RlZCIs%0AWyJhbXBsaXR1ZGUiXV1d%0A',
'if-none-match': 'W/"3658e5098c91c183288fd70e6cfd9028"',
}
response = requests.get('https://www.g2.com/products/mailchimp/reviews', headers=headers)
text = BeautifulSoup(response.text, 'html.parser')
num_reviews = 500
reviews = text.select('div[class*="paper paper--white paper--box"]')
print(len(reviews))
Output
25
Explanation
Sometimes in order to make an HTTP request, it's necessary to pass either headers, user agent, cookies, parameters. You can play about with this, I must admit I was lazy and just sent the entire headers. Essentially you're trying to mimic a browser request by using the requests package. Sometimes it's abit more nuanced in detecting a bot.
Here i've inspected the page and gone to network tools. There's a tab called doc. I've then copied that request, by right clicking the request and clicking COPY curl(bash). As I said i'm lazy so I've pasted that into curl.trillworks.com which converts this into a nice python format as well the boilerplate for a request.
I've modified your script slightly as it was quite a long attribute
the CSS selector div[class*=""] grabs any element with class "" you specific.
So I'm trying to scrape the open positions on this site and when I use any type of requests (currently trying request-html) it doesn't show everything that's in the HTML.
# Import libraries
import time
from bs4 import BeautifulSoup
from requests_html import HTMLSession
# Set the URL you want to webscrape from
url = 'https://germanamerican.csod.com/ux/ats/careersite/5/home?c=germanamerican'
session = HTMLSession()
# Connect to the URL
response = session.get(url)
response.html.render()
# Parse HTML and save to BeautifulSoup object¶
soup = BeautifulSoup(response.text, "html5lib")
b = soup.findAll('a')
Not sure where to go. Originally thought the problem was due to javascript rendering but this is not working.
The issue is that the initial GET doesn't get the data (which I assume is the job listings), and the js that does do that, uses a POST with a authorization token in the header. You need to get this token and then make the POST to get the data.
This token appears to be dynamic so we're going to get a little wonky getting it, but doable.
url0=r'https://germanamerican.csod.com/ux/ats/careersite/5/home?c=germanamerican'
url=r'https://germanamerican.csod.com/services/x/career-site/v1/search'
s=HTMLSession()
r=s.get(url0)
print(r.status_code)
r.html.render()
soup=bs(r.text,'html.parser')
scripts=soup.find_all('script')
for script in scripts:
if 'csod.context=' in script.text: x=script
j=json.loads(x.text.replace('csod.context=','').replace(';',''))
payload={
'careerSiteId': 5,
'cities': [],
'countryCodes': [],
'cultureId': 1,
'cultureName': "en-US",
'customFieldCheckboxKeys': [],
'customFieldDropdowns': [],
'customFieldRadios': [],
'pageNumber': 1,
'pageSize': 25,
'placeID': "",
'postingsWithinDays': None,
'radius': None,
'searchText': "",
'states': []
}
headers={
'accept': 'application/json; q=1.0, text/*; q=0.8, */*; q=0.1',
'accept-encoding': 'gzip, deflate, br',
'accept-language': 'en-US,en;q=0.9',
'authorization': 'Bearer '+j['token'],
'cache-control': 'no-cache',
'content-length': '272',
'content-type': 'application/json',
'csod-accept-language': 'en-US',
'origin': 'https://germanamerican.csod.com',
'referer': 'https://germanamerican.csod.com/ux/ats/careersite/5/home?c=germanamerican',
'sec-fetch-mode': 'cors',
'sec-fetch-site': 'same-origin',
'user-agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/78.0.3904.108 Safari/537.36',
'x-requested-with': 'XMLHttpRequest'
}
r=s.post(url,headers=headers,json=payload)
print(r.status_code)
print(r.json())
the r.json() thats printed out is a nice json format version of the table of job listings.
I don't think it's possible to scrape that website with Requests.
I would suggest using Selenium or Scrapy.
Welcome to SO!
Unfortunately, you won't be able to scrape that page with requests (nor requests_html or similar libraries) because you need a tool to handle dynamic pages - i.e., javascript-based.
With python, I would strongly suggest selenium and its webdriver. Below a piece of code that prints the desired output - i.e., all listed jobs (NB it requires selenium and Firefox webdriver to be installed and with the correct PATH to run)
# Import libraries
from bs4 import BeautifulSoup
from selenium import webdriver
# Set the URL you want to webscrape from
url = 'https://germanamerican.csod.com/ux/ats/careersite/5/home?c=germanamerican'
browser = webdriver.Firefox() # initialize the webdriver. I use FF, might be Chromium or else
browser.get(url) # go to the desired page. You might want to wait a bit in case of slow connection
page = browser.page_source # this is the page source, now full with the listings that have been uploaded
soup = BeautifulSoup(page, "lxml")
jobs = soup.findAll('a', {'data-tag' : 'displayJobTitle'})
for j in jobs:
print(j.text)
browser.quit()
I've been using BeautifulSoup to scrape Amazon for data on products.
the full program has been working fine, up until it gave me this error message
price = soup.find(id="priceblock_ourprice").get_text()
AttributeError: 'NoneType' object has no attribute 'get_text'
When it does this, I find that every version and program i have using beautifulsoup also gives the same failure, even those are unchanged since they were last working. When one of them works again, all of them begin working until all programs fail again.
This includes new programs that I write or others that I try out.
Its had me rather confused rewriting the syntax trying to find the problem, at one point I thought it was the header as changing that initally got it working again, but then it stopped shortly after.
def check_price(URL, headers):
page = requests.get(URL, headers=headers)
soup = BeautifulSoup(page.content, 'html.parser')
price = soup.find(id="priceblock_ourprice").get_text()
converted_price = price[:-3]# -3 removes the .99 pence value from product
float_price = ''
for c in converted_price:
if c.isdigit():
float_price = float_price + c
#loop that removes the £$,. from product so the string can convert to float
return float(float_price)
An example URl would be: https://www.amazon.co.uk/Sony-ILCE7M3B-Mirrorless-Compact-System/dp/B07B4L1PQ8/ref=sr_1_fkmr1_1?keywords=sony+camera+ilce-7m3+6000+alpha&qid=1574887164&sr=8-1-fkmr1
Thanks!
According to your question:
Be informed that AMAZON not allowing automated access to for it's data! So you can double check this by checking the response via r.status_code ! which can lead you to have that error MSG:
To discuss automated access to Amazon data please contact api-services-support#amazon.com
That's why your script might work sometimes and sometimes not.
Therefore you can use AMAZON API or you can pass a list of proxies to the GET request via proxies = list_proxies.
I've corrected some mistakes in your code and simplified it.
import requests
from bs4 import BeautifulSoup
headers = {
'Host': 'www.amazon.co.uk',
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:70.0) Gecko/20100101 Firefox/70.0',
'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8',
'Accept-Language': 'en-US,en;q=0.5',
'Accept-Encoding': 'gzip, deflate, br',
'Connection': 'keep-alive',
'Upgrade-Insecure-Requests': '1',
'TE': 'Trailers'
}
r = requests.get('https://www.amazon.co.uk/Sony-ILCE7M3B-Mirrorless-Compact-System/dp/B07B4L1PQ8/ref=sr_1_fkmr1_1?keywords=sony+camera+ilce-7m3+6000+alpha&qid=1574887164&sr=8-1-fkmr1', headers=headers)
soup = BeautifulSoup(r.text, 'html.parser')
for item in soup.findAll('span', attrs={'class': 'a-size-medium a-color-price priceBlockBuyingPriceString'}):
item = str(item.text[1:]).split('.')[0]
print(item)
Output:
1,753
Online Testing: Click Here
I am trying to write a python script that will scrape http://www.fakenewsai.com/ and tell me whether or not a news article is fake news. I want the script to input a given news article into the website's url input field and hit the submit button. Then, I want to scrape the website to determine whether the article is "fake" or "real" news, as displayed on the website.
I was successful in accomplishing this using selenium and ChromeDriver, but the script was very slow (>2 minutes) and did not run on Heroku (using flask). For reference, here is the code I used:
from selenium import webdriver
import time
def fakeNews(url):
if url.__contains__("https://"):
url = url[8:-1]
if url.__contains__("http://"):
url = url[7:-1]
browser = webdriver.Chrome("static/chromedriver.exe")
browser.get("http://www.fakenewsai.com")
element = browser.find_element_by_id("url")
element.send_keys(url)
button = browser.find_element_by_id("submit")
button.click()
time.sleep(1)
site = "" + browser.page_source
result = ""
if(site[site.index("opacity: 1")-10] == "e"):
result = "Fake News"
else:
result = "Real News"
browser.quit()
return result
print(fakeNews('https://www.nytimes.com/2019/11/02/opinion/sunday/instagram-social-media.html'))
I have attempted to replicate this code using other python libraries, such as mechanicalsoup, pyppeteer, and scrapy. However, as a beginner at python, I have not found much success. I was hoping someone could point me in the right direction with a solution.
For the stated purpose, in my opinion it would be much more simple to analyze the website, understand it's functionality and then automate the browser behavior instead of the user behavior.
Try to hit F12 on your browser while on the website, open the Network tab, paste a URL on the input box and then hit submit, you will see that it sends a HTTP OPTIONS request and then a POST request to a URL. The server then returns a JSON response as a result.
So, you can use Python's request module (docs) to automate the very POST request instead of having a very complex code that simulates clicks and scrapes the result.
A very simple example you can build on is:
import json
import requests
def fake_news():
url = 'https://us-central1-fake-news-ai.cloudfunctions.net/detect/'
payload = {'url': 'https://www.nytimes.com/'}
headers = {'Accept': '*/*', 'Accept-Encoding': 'gzip, deflate, br', 'Accept-Language': 'en-US,en;q=0.5',
'Connection': 'keep-alive', 'Content-Length': '103', 'Content-type': 'application/json; charset=utf-8',
'DNT': '1', 'Host': 'us-central1-fake-news-ai.cloudfunctions.net', 'Origin': 'http://www.fakenewsai.com',
'Referer': 'http://www.fakenewsai.com/', 'TE': 'Trailers',
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:70.0) Gecko/20100101 Firefox/70.0'}
response_json = requests.post(url, data=json.dumps(payload), headers=headers).text
response = json.loads(response_json)
is_fake = int(response['fake'])
if is_fake == 0:
print("Not fake")
elif is_fake == 1:
print("Fake")
else:
print("Invalid response from server")
if __name__ == "__main__":
fake_news()
PS: It would be fair to contact the owner of the website to discuss using his or her infrastructure for your project.
The main slowdown occurs on starting a chrome browser and locating the first URL.
Note that you are launching a browser for each request.
You can launch a browser on the initialization step and only do the automation parts per request.
This will greatly increase the performance.