this is a web scraping project I'm working on.
I need to send the response of this v2 recaptcha but it's not bringing the data I need
`
headers = {
'accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/avif,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3;q=0.9',
'user-agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/105.0.0.0 Safari/537.36'
}
url = 'https://www2.detran.rn.gov.br/externo/consultarveiculo.asp'
session = requests.session()
fazer_get = session.get(url, headers=headers)
cookie = fazer_get.cookies
html = fazer_get.text
try:
rgxCaptchaKey = re.search(r'<div\s*class="g-recaptcha"\s*data-\s*sitekey="([^\"]*?)"></div>', html, re.IGNORECASE)
captchaKey = rgxCaptchaKey.group(1)
except:
print('erro')
resposta_captcha = captcha(captchaKey, url, KEY)
placa = 'pcj90'
renavam = '57940'
payload = {
'oculto:' 'AvancarC'
'placa': placa,
'renavam': renavam,
'g-recaptcha-response': resposta_captcha['code'],
'btnConsultaPlaca': ''
}
fazerPost = session.post(
url, payload,
headers=headers,
cookies=cookie)
`
I tried to send the captcha response in the payload but I couldn't get to the page I want
If the website you're trying to scrape is reCaptcha protected, your best bet is to use a stealthy method for scraping. That means either Selenium (with at least selenium-stealth) or a third party web scraper, such as WebScrapingAPI, where I'm an engineer.
The advantage of using the third party service is that it usually comes packed with reCaptcha solving, IP rotation systems and other various features to prevent bot detection, so you can focus on building handling the scraped data, rather than building the scraper.
In order to have a better view on both options, here are two implementation examples you can compare:
1. Python With Stealthy Selenium
from selenium import webdriver
from selenium.webdriver.chrome.options import Options
from selenium_stealth import stealth
from bs4 import BeautifulSoup
URL = 'https://www2.detran.rn.gov.br/externo/consultarveiculo.asp'
options = Options()
options.add_argument("start-maximized")
options.add_experimental_option("excludeSwitches", ["enable-automation"])
options.add_experimental_option('useAutomationExtension', False)
driver = webdriver.Chrome(options=options)
stealth(driver,
languages=["en-US", "en"],
vendor="Google Inc.",
platform="Win32",
webgl_vendor="Intel Inc.",
renderer="Intel Iris OpenGL Engine",
fix_hairline=True)
driver.get(URL)
html = driver.page_source
driver.quit()
You should also look into integrating a captcha solver (like 2captcha) with his code.
2. Python With WebScrapingAPI
import requests
URL = 'https://www2.detran.rn.gov.br/externo/consultarveiculo.asp'
API_KEY = '<YOUR_API_KEY>'
SCRAPER_URL = 'https://api.webscrapingapi.com/v1'
params = {
"api_key":API_KEY,
"url": URL,
"render_js":"1",
"js_instructions":'''
[{
"action":"value",
"selector":"input#placa",
"timeout": 5000,
"value":"<YOUR_EMAIL_OR_USERNAME>"
},
{
"action":"value",
"selector":"input#renavam",
"timeout": 5000,
"value":"<YOUR_PASSWORD>"
},
{
"action":"submit",
"selector":"button#btnConsultaPlaca",
"timeout": 5000
}]
'''
}
res = requests.get(SCRAPER_URL, params=params)
print(res.text)
Related
I want to scrape a hidden phone number from a website using beautifulsoup
https://haraj.com.sa/1194697687, as you can see in this link
the phone number is hidden, and it only showed when you click "التواصل" button
The Button
Here is my code
from bs4 import BeautifulSoup
url = "https://haraj.com.sa/1199808969"
headers = {'User-Agent':'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/104.0.0.0 Safari/537.36.'}
r = requests.get(url,headers=headers)
soup = BeautifulSoup(r.content,features='lxml')
post = soup.find('span', {'class', 'contact'})
print(post)
and here is the output I got
<span class="contact"><button class="sc-bdvvaa AGAbw" type="button"><img src="https://v8-cdn.haraj.com.sa/logos/contact_logo.svg" style="margin-left:5px;filter:brightness(0) invert(1)"/>التواصل</button></span>
BeautifulSoup won't be enough for what you're trying to do - it's just an HTML parser. And Selenium is overkill. The page you're trying to scrape from uses JavaScript to dynamically and asynchronously populate the DOM with content when you press the button. If you make a request to that page in Python, and try to parse the HTML, you're only looking at the barebones template, which would normally get populated later on by the browser. The data for the modal comes from a fetch/XHR HTTP POST request to a GraphQL API, the response of which is JSON. If you use your browser's developer tools to log your network traffic when you press the button, you can see the HTTP request URL, query-string parameters, POST payload, request headers, etc. You just need to mimic that request in Python - fortunately this API seems to be pretty lenient, so you won't have to provide all the same parameters that the browser provides:
def main():
import requests
url = "https://graphql.haraj.com.sa"
params = {
"queryName": "postContact",
"token": "",
"clientId": "",
"version": ""
}
headers = {
"user-agent": "Mozilla/5.0"
}
payload = {
"query": "query postContact($postId: Int!) {postContact(postId: $postId){contactText}}",
"variables": {
"postId": 94697687
}
}
response = requests.post(url, params=params, headers=headers, json=payload);
response.raise_for_status()
print(response.json())
return 0
if __name__ == "__main__":
import sys
sys.exit(main())
Output:
{'data': {'postContact': {'contactText': '0562038953'}}}
I was able to do this using selenium and chromedriver https://chromedriver.chromium.org/downloads just be sure to change the path to where you extract chromedriver and install selenium via pip;
pip install selenium
main.py
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.chrome.options import Options
import time
url = "https://haraj.com.sa/1199808969"
def main():
print(get_value())
def get_value():
chrome_options = Options()
chrome_options.add_argument("--headless")
driver = webdriver.Chrome("C:\Developement\chromedriver.exe",chrome_options=chrome_options)
driver.get(url)
driver.find_element(By.CLASS_NAME, "AGAbw").click()
time.sleep(5)
val = driver.find_element(By.XPATH, '//*[#id="modal"]/div/div/a[2]/div[2]').text
driver.quit()
return val
main()
Output:
[0829/155029.109:INFO:CONSOLE(1)] "HBM Loaded", source: https://v8-cdn.haraj.com.sa/main.bf98551ba68f8bd6bee4.js (1)
[0829/155030.571:INFO:CONSOLE(1)] "[object Object]", source: https://v8-cdn.haraj.com.sa/main.bf98551ba68f8bd6bee4.js (1)
[0829/155030.604:INFO:CONSOLE(1)] "[object Object]", source: https://v8-cdn.haraj.com.sa/main.bf98551ba68f8bd6bee4.js (1)
[0829/155031.143:INFO:CONSOLE(16)] "Yay! SW loaded 🎉", source: https://haraj.com.sa/sw.js (16)
0559559838
So I'm really new to web scraping and I want to build a bot that checks the Uber ride price from point A to point B over a period of time. I used the Selenium library to input the pickup location and the destination and now I want to scrape the resulting estimated price from the page.
from selenium import webdriver
from selenium.webdriver.common.keys import Keys
import time
# Initialize webdriver object
firefox_webdriver_path = '/usr/local/bin/geckodriver'
webdriver = webdriver.Firefox(executable_path=firefox_webdriver_path)
webdriver.get('https://www.uber.com/global/en/price-estimate/')
time.sleep(3)
# Find the search box
elem = webdriver.find_element_by_name('pickup')
elem.send_keys('name/of/the/pickup/location')
time.sleep(1)
elem.send_keys(Keys.ENTER)
time.sleep(1)
elem2 = webdriver.find_element_by_name('destination')
elem2.send_keys('name/of/the/destination')
time.sleep(1)
elem2.send_keys(Keys.ENTER)
time.sleep(5)
elem3 = webdriver.find_element_by_class_name('bn rw bp nk ih cr vk')
print(elem3.text)
Unfortunately, there's an error:
selenium.common.exceptions.NoSuchElementException: Message: Unable to locate element: .bn rw bp nk ih cr vk
And I can't seem to figure out a solution. While inspecting the page, the name of the class that holds the price has the following name "bn rw bp nk ih cr vk" and after some searches I found out this might be Javascript instead of HTML. (I would also point out that I'm not really familiar with them.)
Inspecting the page
Eventually, I thought I could use BeautifulSoup and requests modules and faced another error.
import requests
from bs4 import BeautifulSoup
import re
import json
response = requests.get('https://www.uber.com/global/en/price-estimate/')
print(response.status_code)
406
I also tried changing the User Agent in hope of resolving this HTTP error message, but it did not work. I have no idea how to approach this.
Not exactly what you need. But I recently made a similar application and I will provide part of the function that you need. The only thing you need is to get the latitude and longitude. I used google_places provider for this, but iam sure there is many free services for this.
import requests
import json
def get_ride_price(origin_latitude, origin_longitude, destination_latitude, destination_longitude):
url = "https://www.uber.com/api/loadFEEstimates?localeCode=en"
payload = json.dumps({
"origin": {
"latitude": origin_latitude,
"longitude": origin_longitude
},
"destination": {
"latitude": destination_latitude,
"longitude": destination_longitude
},
"locale": "en"
})
headers = {
'content-type': 'application/json',
'user-agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/103.0.0.0 Safari/537.36',
'x-csrf-token': 'x'
}
response = requests.request("POST", url, headers=headers, data=payload)
result = [[x['vehicleViewDisplayName'], x['fareString']] for x in response.json()['data']['prices']]
return result
print(get_ride_price(51.5072178, -0.1275862, 51.4974948, -0.1356583))
OUTPUT:
[['Assist', '£13.84'], ['Access', '£13.84'], ['Green', '£13.86'], ['UberX', '£14.53'], ['Comfort', '£16.02'], ['UberXL', '£17.18'], ['Uber Pet', '£17.77'], ['Exec', '£20.88'], ['Lux', '£26.32']]
Here is a screenshot of the response I am trying to scrape. I am trying to scrape the response that is received on restaurant file (document type). I am trying to scrape using selenium and Python.
I tried different ways to scrape it but couldn't find a particular way.
I tried using desired_capabilities to get the logs:
desired_capabilities["goog:loggingPrefs"] = {"performance": "ALL"}
The closest I can get to extracting the info about the restaurant response is in the code below, which prints out the info about the restaurant file. However, still I am unable to get the response of the document.
I am trying to scrape this website.
import time
import json
from selenium import webdriver
from selenium.webdriver.common.action_chains import ActionChains
from selenium.webdriver.common.keys import Keys
from selenium.webdriver import DesiredCapabilities
# make chrome log requests
def get_actions(driver):
actions = ActionChains(driver)
return actions
def setup():
desired_capabilities = DesiredCapabilities.CHROME
desired_capabilities["goog:loggingPrefs"] = {"performance": "ALL"}
driver = webdriver.Chrome(
desired_capabilities=desired_capabilities, executable_path=r"C:\Users\visha\PycharmProjects\grabScrapper\chromedriver.exe"
)
address = 'Manila '
driver.get('https://food.grab.com/ph/en/restaurants')
time.sleep(2) # Let the user actually see something!
driver.find_element_by_class_name("textPlaceholder___1yEAK").click()
actions = get_actions(driver)
actions.send_keys(address)
actions.perform()
time.sleep(1)
actions = get_actions(driver)
actions.send_keys(Keys.ARROW_DOWN)
actions.perform()
time.sleep(1)
actions = get_actions(driver)
actions.send_keys(Keys.ENTER)
actions.perform()
time.sleep(5)
driver.refresh()
time.sleep(5)
logs = driver.get_log("performance")
print(json.loads(logs[0]["message"])["message"])
if __name__ == '__main__':
driver = setup()
Here is the output:
{'method': 'Network.requestWillBeSent', 'params': {'documentURL':
'https://food.grab.com/ph/en/restaurants', 'frameId': '511A291FBAE6A23A8A6CC0151F478245',
'hasUserGesture': False, 'initiator': {'type': 'other'}, 'loaderId':
'398F451EBDC9243DC9A05E18809AD020', 'request': {'headers': {'Upgrade-Insecure-Requests': '1',
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like
Gecko) Chrome/93.0.4577.82 Safari/537.36', 'sec-ch-ua': '"Google Chrome";v="93", " Not;A
Brand";v="99", "Chromium";v="93"', 'sec-ch-ua-mobile': '?0', 'sec-ch-ua-platform':
'"Windows"'}, 'initialPriority': 'VeryHigh', 'isSameSite': True, 'method': 'GET',
'mixedContentType': 'none', 'referrerPolicy': 'strict-origin-when-cross-origin', 'url':
'https://food.grab.com/ph/en/restaurants'}, 'requestId': '398F451EBDC9243DC9A05E18809AD020',
'timestamp': 75113.949225, 'type': 'Document', 'wallTime': 1632732378.366389}}
Please help me to scrape the response data of the restaurant file using Python so that I can do some manipulation on my own.
I'm trying to log in to facebook using requests module. Although it seems I've prepared payload in the right way but when I send it with post requests, I don't get desired content in the response. I get 200 status code, though. To let you know, If I get response accordingly, I should find my fullname within it.
I initially tried like the following:
import requests
from bs4 import BeautifulSoup
from datetime import datetime
link = 'https://www.facebook.com/'
base_url = 'https://www.facebook.com{}'
time = int(datetime.now().timestamp())
headers = {
'User-Agent': 'Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/88.0.4324.150 Safari/537.36',
'referer': 'https://www.facebook.com/',
}
with requests.Session() as s:
r = s.get(link)
soup = BeautifulSoup(r.text,"lxml")
form_url = soup.select_one("form[data-testid='royal_login_form']")['action']
post_url = base_url.format(form_url)
payload = {i['name']:i.get('value','') for i in soup.select('input[name]')}
payload['email'] = 'YOUR_EMAIL'
payload['encpass'] = f'#PWD_BROWSER:0:{time}:YOUR_PASSWORD'
payload.pop('pass')
res = s.post(post_url,data=payload,headers=headers)
print(res.url)
print(res.text)
This is another way I tried which didn't work out either:
import requests
from bs4 import BeautifulSoup
from datetime import datetime
login_url = 'https://www.facebook.com/login/device-based/regular/login/?login_attempt=1&lwv=101'
time = int(datetime.now().timestamp())
headers = {
'user-agent': 'Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.124 Safari/537.36',
'origin': 'https://www.facebook.com',
'referer': 'https://www.facebook.com/login/device-based/regular/login/?login_attempt=1&lwv=101'
}
with requests.Session() as s:
r = s.get(login_url)
soup = BeautifulSoup(r.text,"lxml")
payload = {i['name']:i.get('value','') for i in soup.select('input[name]')}
payload['email'] = 'YOUR_EMAIL'
payload['encpass'] = f'#PWD_BROWSER:0:{time}:YOUR_PASSWORD'
payload['had_password_prefilled'] = 'true'
payload['had_cp_prefilled'] = 'true'
payload['prefill_source'] = 'browser_dropdown'
payload['prefill_type'] = 'contact_point'
payload['first_prefill_source'] = 'last_login'
payload['first_prefill_type'] = 'contact_point'
payload['prefill_contact_point'] = 'YOUR_EMAIL'
payload.pop('pass')
r = s.post(login_url,data=payload,headers=headers)
print(r.status_code)
print(r.url)
How can I log in to facebook using requests?
This might be a case of xy problem
I recommend trying Selenium in accessing Facebook programmatically.
More examples using Selenium in logging in.
https://www.askpython.com/python/examples/python-automate-facebook-login
https://www.guru99.com/facebook-login-using-python.html
If logging in is all that you require, then using selenium, you could do it as:
from selenium import webdriver
from selenium.webdriver.common.keys import Keys
URL = 'https://www.facebook.com/'
PATH = r'C:\Program Files (x86)\chromedriver.exe'
driver = webdriver.Chrome(PATH)
driver.get(URL)
email = driver.find_element_by_id('email')
email.send_keys('YourEmail')
password = driver.find_element_by_id('pass')
password.send_keys('YourPassword')
password.send_keys(Keys.RETURN)
I would recommend that you use the browser that you frequently use to login for this process.
(I've tried looking but all of the other answers seem to be using urllib2)
I've just started trying to use requests, but I'm still not very clear on how to send or request something additional from the page. For example, I'll have
import requests
r = requests.get('http://google.com')
but I have no idea how to now, for example, do a google search using the search bar presented. I've read the quickstart guide but I'm not very familiar with HTML POST and the like, so it hasn't been very helpful.
Is there a clean and elegant way to do what I am asking?
Request Overview
The Google search request is a standard HTTP GET command. It includes a collection of parameters relevant to your queries. These parameters are included in the request URL as name=value pairs separated by ampersand (&) characters. Parameters include data like the search query and a unique CSE ID (cx) that identifies the CSE that is making the HTTP request. The WebSearch or Image Search service returns XML results in response to your HTTP requests.
First, you must get your CSE ID (cx parameter) at Control Panel of Custom Search Engine
Then, See the official Google Developers site for Custom Search.
There are many examples like this:
http://www.google.com/search?
start=0
&num=10
&q=red+sox
&cr=countryCA
&lr=lang_fr
&client=google-csbe
&output=xml_no_dtd
&cx=00255077836266642015:u-scht7a-8i
And there are explained the list of parameters that you can use.
import requests
from bs4 import BeautifulSoup
headers_Get = {
'User-Agent': 'Mozilla/5.0 (Windows NT 6.1; WOW64; rv:49.0) Gecko/20100101 Firefox/49.0',
'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8',
'Accept-Language': 'en-US,en;q=0.5',
'Accept-Encoding': 'gzip, deflate',
'DNT': '1',
'Connection': 'keep-alive',
'Upgrade-Insecure-Requests': '1'
}
def google(q):
s = requests.Session()
q = '+'.join(q.split())
url = 'https://www.google.com/search?q=' + q + '&ie=utf-8&oe=utf-8'
r = s.get(url, headers=headers_Get)
soup = BeautifulSoup(r.text, "html.parser")
output = []
for searchWrapper in soup.find_all('h3', {'class':'r'}): #this line may change in future based on google's web page structure
url = searchWrapper.find('a')["href"]
text = searchWrapper.find('a').text.strip()
result = {'text': text, 'url': url}
output.append(result)
return output
Will return an array of google results in {'text': text, 'url': url} format. Top result url would be google('search query')[0]['url']
input:
import requests
def googleSearch(query):
with requests.session() as c:
url = 'https://www.google.co.in'
query = {'q': query}
urllink = requests.get(url, params=query)
print urllink.url
googleSearch('Linkin Park')
output:
https://www.google.co.in/?q=Linkin+Park
The readable way to send a request with many query parameters would be to pass URL parameters as a dictionary:
params = {
'q': 'minecraft', # search query
'gl': 'us', # country where to search from
'hl': 'en', # language
}
requests.get('URL', params=params)
But, in order to get the actual response (output/text/data) that you see in the browser you need to send additional headers, more specifically user-agent which is needed to act as a "real" user visit when bot or browser sends a fake user-agent string to announce themselves as a different client.
The reason that your request might be blocked is that the default requests user agent is python-requests and websites understand that. Check what's your user agent.
You can read more about it in the blog post I wrote about how to reduce the chance of being blocked while web scraping.
Pass user-agent:
headers = {
'User-agent':
'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/107.0.0.0 Safari/537.36'
}
requests.get('URL', headers=headers)
Code and example in the online IDE:
from bs4 import BeautifulSoup
import requests, lxml
headers = {
'User-agent':
'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/107.0.0.0 Safari/537.36'
}
params = {
'q': 'minecraft',
'gl': 'us',
'hl': 'en',
}
html = requests.get('https://www.google.com/search', headers=headers, params=params)
soup = BeautifulSoup(html.text, 'lxml')
for result in soup.select('.tF2Cxc'):
title = result.select_one('.DKV0Md').text
link = result.select_one('.yuRUbf a')['href']
print(title, link, sep='\n')
Alternatively, you can achieve the same thing by using Google Organic API from SerpApi. It's a paid API with a free plan.
The difference is that you don't have to create it from scratch and maintain it.
Code to integrate:
import os
from serpapi import GoogleSearch
params = {
"engine": "google",
"q": "tesla",
"hl": "en",
"gl": "us",
"api_key": os.getenv("API_KEY"),
}
search = GoogleSearch(params)
results = search.get_dict()
for result in results["organic_results"]:
print(result['title'])
print(result['link'])
Disclaimer, I work for SerpApi.
In this code by using bs4 you can get all the h3 and print their text
# Import the beautifulsoup
# and request libraries of python.
import requests
import bs4
# Make two strings with default google search URL
# 'https://google.com/search?q=' and
# our customized search keyword.
# Concatenate them
text= "c++ linear search program"
url = 'https://google.com/search?q=' + text
# Fetch the URL data using requests.get(url),
# store it in a variable, request_result.
request_result=requests.get( url )
# Creating soup from the fetched request
soup = bs4.BeautifulSoup(request_result.text,"html.parser")
filter=soup.find_all("h3")
for i in range(0,len(filter)):
print(filter[i].get_text())
You can use 'webbroser', I think it doesn't get easier than that:
import webbrowser
query = input('Enter your query: ')
webbrowser.open(f'https://google.com/search?q={query}')