Scrape a phone number inside a popup button using python beautifulsoup - python

I want to scrape a hidden phone number from a website using beautifulsoup
https://haraj.com.sa/1194697687, as you can see in this link
the phone number is hidden, and it only showed when you click "التواصل" button
The Button
Here is my code
from bs4 import BeautifulSoup
url = "https://haraj.com.sa/1199808969"
headers = {'User-Agent':'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/104.0.0.0 Safari/537.36.'}
r = requests.get(url,headers=headers)
soup = BeautifulSoup(r.content,features='lxml')
post = soup.find('span', {'class', 'contact'})
print(post)
and here is the output I got
<span class="contact"><button class="sc-bdvvaa AGAbw" type="button"><img src="https://v8-cdn.haraj.com.sa/logos/contact_logo.svg" style="margin-left:5px;filter:brightness(0) invert(1)"/>التواصل</button></span>

BeautifulSoup won't be enough for what you're trying to do - it's just an HTML parser. And Selenium is overkill. The page you're trying to scrape from uses JavaScript to dynamically and asynchronously populate the DOM with content when you press the button. If you make a request to that page in Python, and try to parse the HTML, you're only looking at the barebones template, which would normally get populated later on by the browser. The data for the modal comes from a fetch/XHR HTTP POST request to a GraphQL API, the response of which is JSON. If you use your browser's developer tools to log your network traffic when you press the button, you can see the HTTP request URL, query-string parameters, POST payload, request headers, etc. You just need to mimic that request in Python - fortunately this API seems to be pretty lenient, so you won't have to provide all the same parameters that the browser provides:
def main():
import requests
url = "https://graphql.haraj.com.sa"
params = {
"queryName": "postContact",
"token": "",
"clientId": "",
"version": ""
}
headers = {
"user-agent": "Mozilla/5.0"
}
payload = {
"query": "query postContact($postId: Int!) {postContact(postId: $postId){contactText}}",
"variables": {
"postId": 94697687
}
}
response = requests.post(url, params=params, headers=headers, json=payload);
response.raise_for_status()
print(response.json())
return 0
if __name__ == "__main__":
import sys
sys.exit(main())
Output:
{'data': {'postContact': {'contactText': '0562038953'}}}

I was able to do this using selenium and chromedriver https://chromedriver.chromium.org/downloads just be sure to change the path to where you extract chromedriver and install selenium via pip;
pip install selenium
main.py
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.chrome.options import Options
import time
url = "https://haraj.com.sa/1199808969"
def main():
print(get_value())
def get_value():
chrome_options = Options()
chrome_options.add_argument("--headless")
driver = webdriver.Chrome("C:\Developement\chromedriver.exe",chrome_options=chrome_options)
driver.get(url)
driver.find_element(By.CLASS_NAME, "AGAbw").click()
time.sleep(5)
val = driver.find_element(By.XPATH, '//*[#id="modal"]/div/div/a[2]/div[2]').text
driver.quit()
return val
main()
Output:
[0829/155029.109:INFO:CONSOLE(1)] "HBM Loaded", source: https://v8-cdn.haraj.com.sa/main.bf98551ba68f8bd6bee4.js (1)
[0829/155030.571:INFO:CONSOLE(1)] "[object Object]", source: https://v8-cdn.haraj.com.sa/main.bf98551ba68f8bd6bee4.js (1)
[0829/155030.604:INFO:CONSOLE(1)] "[object Object]", source: https://v8-cdn.haraj.com.sa/main.bf98551ba68f8bd6bee4.js (1)
[0829/155031.143:INFO:CONSOLE(16)] "Yay! SW loaded 🎉", source: https://haraj.com.sa/sw.js (16)
0559559838

Related

Bypassing recaptcha v2 using python requests

this is a web scraping project I'm working on.
I need to send the response of this v2 recaptcha but it's not bringing the data I need
`
headers = {
'accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/avif,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3;q=0.9',
'user-agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/105.0.0.0 Safari/537.36'
}
url = 'https://www2.detran.rn.gov.br/externo/consultarveiculo.asp'
session = requests.session()
fazer_get = session.get(url, headers=headers)
cookie = fazer_get.cookies
html = fazer_get.text
try:
rgxCaptchaKey = re.search(r'<div\s*class="g-recaptcha"\s*data-\s*sitekey="([^\"]*?)"></div>', html, re.IGNORECASE)
captchaKey = rgxCaptchaKey.group(1)
except:
print('erro')
resposta_captcha = captcha(captchaKey, url, KEY)
placa = 'pcj90'
renavam = '57940'
payload = {
'oculto:' 'AvancarC'
'placa': placa,
'renavam': renavam,
'g-recaptcha-response': resposta_captcha['code'],
'btnConsultaPlaca': ''
}
fazerPost = session.post(
url, payload,
headers=headers,
cookies=cookie)
`
I tried to send the captcha response in the payload but I couldn't get to the page I want
If the website you're trying to scrape is reCaptcha protected, your best bet is to use a stealthy method for scraping. That means either Selenium (with at least selenium-stealth) or a third party web scraper, such as WebScrapingAPI, where I'm an engineer.
The advantage of using the third party service is that it usually comes packed with reCaptcha solving, IP rotation systems and other various features to prevent bot detection, so you can focus on building handling the scraped data, rather than building the scraper.
In order to have a better view on both options, here are two implementation examples you can compare:
1. Python With Stealthy Selenium
from selenium import webdriver
from selenium.webdriver.chrome.options import Options
from selenium_stealth import stealth
from bs4 import BeautifulSoup
URL = 'https://www2.detran.rn.gov.br/externo/consultarveiculo.asp'
options = Options()
options.add_argument("start-maximized")
options.add_experimental_option("excludeSwitches", ["enable-automation"])
options.add_experimental_option('useAutomationExtension', False)
driver = webdriver.Chrome(options=options)
stealth(driver,
languages=["en-US", "en"],
vendor="Google Inc.",
platform="Win32",
webgl_vendor="Intel Inc.",
renderer="Intel Iris OpenGL Engine",
fix_hairline=True)
driver.get(URL)
html = driver.page_source
driver.quit()
You should also look into integrating a captcha solver (like 2captcha) with his code.
2. Python With WebScrapingAPI
import requests
URL = 'https://www2.detran.rn.gov.br/externo/consultarveiculo.asp'
API_KEY = '<YOUR_API_KEY>'
SCRAPER_URL = 'https://api.webscrapingapi.com/v1'
params = {
"api_key":API_KEY,
"url": URL,
"render_js":"1",
"js_instructions":'''
[{
"action":"value",
"selector":"input#placa",
"timeout": 5000,
"value":"<YOUR_EMAIL_OR_USERNAME>"
},
{
"action":"value",
"selector":"input#renavam",
"timeout": 5000,
"value":"<YOUR_PASSWORD>"
},
{
"action":"submit",
"selector":"button#btnConsultaPlaca",
"timeout": 5000
}]
'''
}
res = requests.get(SCRAPER_URL, params=params)
print(res.text)

Log into a website using Selenium, but continue working (while logged in) with requests

I am using Selenium and the Chrome web driver to log into my account on a website, but after the login, I want to use other libraries (such as requests) to interact with the website.
I am using Selenium to attempt to bypass reCAPTCHA v3, but I want to use the requests and beautifulsoup libraries to scrape data in the URL that comes after the login page (The URL that the login page redirects to, after logging in ).
Here is the code I've written for logging in, and a little snippet at the bottom which I plan to use for scraping the website post-login.
import requests
import os
import time
from selenium import webdriver
from selenium.webdriver.chrome.options import Options
from selenium.webdriver.common.action_chains import ActionChains
chrome_options = Options()
chrome_options.add_argument("--headless")
driver = webdriver.Chrome("chromedriver", options=chrome_options)
action = ActionChains(driver)
url_1 = "https://ais.usvisa-info.com/en-am/niv/users/sign_in"
url_2 = "https://ais.usvisa-info.com/en-am/niv/account/settings/update_email"
email = "email"
password = 'password'
Headers = {
"user-agent": "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/89.0.4389.82 Safari/537.36"
}
def login():
driver.get(url_1)
driver.find_element_by_id("user_email").send_keys(email)
driver.find_element_by_id("user_password").send_keys(password)
driver.find_elements_by_class_name("icheckbox")[0].click()
driver.find_elements_by_name("commit")[0].click()
time.sleep(1)
print(driver.current_url)
login()
test = requests.get(url, headers=Headers)
What logging in is actually doing is modifying your cookies to add a key, which verifies that you are logged in. What we can do with this info is to take the cookie data and reuse it for the Python requests module. Let's start by extracting the cookies from the webdriver like so:
driver_cookies = driver.get_cookies()
Now that you have your cookies, you can inject them into future requests in the cookies parameter, like so:
test = requests.get(url, headers=Headers, cookies=driver_cookies)

How to get request headers in Selenium

https://www.sahibinden.com/en
If you open it incognito window and check headers in Fiddler then these are the two main headers you get:
When I click the last one and check request headers this is what I get
I want to get these headers in Python. Is there any way that I can get these using Selenium? Im a bit clueless here.
You can use Selenium Wire. It is a Selenium extension which has been developed for this exact purpose.
https://pypi.org/project/selenium-wire/
An example after pip install:
## Import webdriver from Selenium Wire instead of Selenium
from seleniumwire import webdriver
## Get the URL
driver = webdriver.Chrome("my/path/to/driver", options=options)
driver.get("https://my.test.url.com")
## Print request headers
for request in driver.requests:
print(request.url) # <--------------- Request url
print(request.headers) # <----------- Request headers
print(request.response.headers) # <-- Response headers
You can run JS command like this;
var req = new XMLHttpRequest()
req.open('GET', document.location, false)
req.send(null)
return req.getAllResponseHeaders()
On Python;
driver.get("https://t.me/codeksiyon")
headers = driver.execute_script("var req = new XMLHttpRequest();req.open('GET', document.location, false);req.send(null);return req.getAllResponseHeaders()")
# type(headers) == str
headers = headers.splitlines()
The bottom line is, No, you can't retrieve the request headers using Selenium.
Details
It had been a long time demand from the Selenium users to add the WebDriver methods to read the HTTP status code and headers from a HTTP response. We have discussed about implementing this feature through Selenium at length within the discussion WebDriver lacks HTTP response header and status code methods.
However, Jason Leyba (Selenium contributor) in his comment straightly mentioned:
We will not be adding this feature to the WebDriver API as it falls outside of our current scope (emulating user actions).
Ashley Leyba further added, attempting to make WebDriver the ideal web testing tool will suffer in overall quality as driver.get(url) blocks until the browser has loaded the page and return the response for the final loaded page. So in case of a login redirects, status codes and headers will always end up with a 200 instead of the 302 you're looking for.
Finally, Simon M Stewart (WebDriver creator) in his comment concluded that:
This feature isn't going to happen. The recommended approach is to either extend the HtmlUnitDriver to access the information you require or to make use of an external proxy that exposes this information such as the BrowserMob Proxy
It's not possible to get headers using Selenium. Further information
However, you might use other libraries such as requests, BeautifulSoup to get headers.
Maybe you can use BrowserMob Proxy for this. Here is a example:
import settings
from browsermobproxy import Server
from selenium.webdriver import DesiredCapabilities
config = settings.Config
server = Server(config.BROWSERMOB_PATH)
server.start()
proxy = server.create_proxy()
from selenium import webdriver
chrome_options = webdriver.ChromeOptions()
chrome_options.add_argument('--proxy-server=%s' % proxy.proxy)
chrome_options.add_argument('--headless')
capabilities = DesiredCapabilities.CHROME.copy()
capabilities['acceptSslCerts'] = True
capabilities['acceptInsecureCerts'] = True
driver = webdriver.Chrome(options=chrome_options,
desired_capabilities=capabilities,
executable_path=config.CHROME_PATH)
proxy.new_har("sahibinden", options={'captureHeaders': True})
driver.get("https://www.sahibinden.com/en")
entries = proxy.har['log']["entries"]
for entry in entries:
if 'request' in entry.keys():
print(entry['request']['url'])
print(entry['request']['headers'])
print('\n')
proxy.close()
driver.quit()
js_headers = '''
const _xhr = new XMLHttpRequest();
_xhr.open("HEAD", document.location, false);
_xhr.send(null);
const _headers = {};
_xhr.getAllResponseHeaders().trim().split(/[\\r\\n]+/).map((value) => value.split(/: /)).forEach((keyValue) => {
_headers[keyValue[0].trim()] = keyValue[1].trim();
});
return _headers;
'''
page_headers = driver.execute_script(js_headers)
type(page_headers) # -> dict
You can use https://pypi.org/project/selenium-wire/ a plug-in replacement for webdriver adding request/response manipulation even for https by using its own local ssl certificate.
from seleniumwire import webdriver
d = webdriver.Chrome() # make sure chrome/chromedriver is in path
d.get('https://en.wikipedia.org')
vars(d.requests[-1].headers)
will list the headers in the last requests object list:
{'policy': Compat32(), '_headers': [('content-length', '1361'),
('content-type', 'application/json'), ('sec-fetch-site', 'none'),
('sec-fetch-mode', 'no-cors'), ('sec-fetch-dest', 'empty'),
('user-agent', 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/104.0.5112.102 Safari/537.36'),
('accept-encoding', 'gzip, deflate, br')],
'_unixfrom': None, '_payload': None, '_charset': None,
'preamble': None, 'epilogue': None, 'defects': [], '_default_type': 'text/plain'}

Scraping JSON from AJAX calls

Background
Considering this url:
base_url = "https://www.olx.bg/ad/sobstvenik-tristaen-kamenitsa-1-CID368-ID81i3H.html"
I want to make the ajax call for the telephone number:
ajax_url = "https://www.olx.bg/ajax/misc/contact/phone/7XarI/?pt=e3375d9a134f05bbef9e4ad4f2f6d2f3ad704a55f7955c8e3193a1acde6ca02197caf76ffb56977ce61976790a940332147d11808f5f8d9271015c318a9ae729"
Wanted results
If I press the button through the site in my chrome browser in the console I would get the wanted result:
{"value":"088 *****"}
debugging
If I open a new tab and paste the ajax_url I would always get empty values:
{"value":"000 000 000"}
If I try something like:
Bash:
wget $ajax_url
Python:
import requests
json_response= requests.get(ajax_url)
I would just receive the html of the the site's handling page that there is an error.
Ideas
I have something more when I am opening the request with the browser. What more do I have? maybe a cookie?
How do I get the wanted result with Bash/Python ?
Edit
the code of the response html is 200
I have tried with curl I get the same html problem.
Kind of a fix.
I have noticed that if I copy the cookie of the browser, and make a request with all the headers INCLUDING the cookie from the browser, I get the correct result
# I think the most important header is the cookie
headers = DICT_WITH_HEADERS_FROM_BROWSER
json_response= requests.get(next_url,
headers=headers,
)
Final question
The only question left is how can I generate a cookie through a Python script?
First you should create a requests Session to store cookies.
Then send a http GET request to the page that is actually calling the ajax request. If any cookie is created by the website, it is sent in GET response and your sessions stores the cookie.
Then you can easily use the session to call ajax api.
Important Note 1:
The ajax url you are calling in the original website is a http POST request! you should not send a get request to that url.
Important Note 2:
You also must extract phoneToken from the website js code which is stored in a variable like var phoneToken = 'here is the pt';
Sample code:
import re
import requests
my_session = requests.Session()
# call html website
base_url = "https://www.olx.bg/ad/sobstvenik-tristaen-kamenitsa-1-CID368-ID81i3H.html"
base_response = my_session.get(url=base_url)
assert base_response.status_code == 200
# extract phone token from base url response
phone_token = re.findall(r'phoneToken\s=\s\'(.+)\';', base_response.text)[0]
# call ajax api
ajax_path = "/ajax/misc/contact/phone/81i3H/?pt=" + phone_token
ajax_url = "https://www.olx.bg" + ajax_path
ajax_headers = {
'accept': '*/*',
'accept-encoding': 'gzip, deflate, br',
'accept-language': 'en-US,en;q=0.9,fa;q=0.8',
'sec-fetch-mode': 'cors',
'sec-fetch-site': 'same-origin',
'Referer': 'https://www.olx.bg/ad/sobstvenik-tristaen-kamenitsa-1-CID368-ID81i3H.html',
'User-Agent': 'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/76.0.3809.100 Safari/537.36'
}
ajax_response = my_session.post(url=ajax_url, headers=ajax_headers)
print(ajax_response.text)
When you run the code above, the result below is displayed:
{"value":"088 558 9937"}
from selenium import webdriver
from bs4 import BeautifulSoup
from selenium.webdriver.firefox.options import Options
from bs4 import BeautifulSoup
import time
options = Options()
options.add_argument('--headless')
driver = webdriver.Firefox(options=options)
driver.get(
'https://www.olx.bg/ad/sobstvenik-tristaen-kamenitsa-1-CID368-ID81i3H.html')
number = driver.find_element_by_xpath(
"/html/body/div[3]/section/div[3]/div/div[1]/div[2]/div/ul[1]/li[2]/div/strong").click()
time.sleep(2)
source = driver.page_source
soup = BeautifulSoup(source, 'html.parser')
phone = soup.find("strong", {'class': 'xx-large'}).text
print(phone)
Output:
088 558 9937

Cache Access Denied. Authentication Required in requests module

I am trying to make a basic web crawler. My internet is through proxy connection. So I used the solution given here. But still while running the code I am getting the error.
My code is:
#!/usr/bin/python3.4
import requests
from bs4 import BeautifulSoup
import urllib.request as req
proxies = {
"http": r"http://usr:pass#202.141.80.22:3128",
"https": r"http://usr:pass#202.141.80.22:3128",
}
url = input("Ask user for something")
def santabanta(max_pages,url):
page = 1
while (page <= max_pages):
source_code = requests.get(url,proxies=proxies)
plain_text = source_code.text
print (plain_text)
soup = BeautifulSoup(plain_text,"lxml")
for link in soup.findAll('a'):
href = link.get('href')
print(href)
page = page + 1
santabanta(1,url)
But while running on terminal in ubuntu 14.04 I am getting the following error:
The following error was encountered while trying to retrieve the URL: http://www.santabanta.com/wallpapers/gauhar-khan/? Cache Access Denied. Sorry, you are not currently allowed to request http://www.santabanta.com/wallpapers/gauhar-khan/? from this cache until you have authenticated yourself.
The url posted by me is:http://www.santabanta.com/wallpapers/gauhar-khan/
Please help me
open the url.
hit F12(chrome user)
now go to "network" in the menu below.
hit f5 to reload the page so that chrome records all the data received from server.
open any of the "received file" and go down to "request header"
pass all the header to request.get()
.[Here is an image to help you][1]
[1]: http://i.stack.imgur.com/zUEBE.png
Make the header as follows:
headers = { 'Accept':' */ * ',
'Accept-Encoding':'gzip, deflate, sdch',
'Accept-Language':'en-US,en;q=0.8',
'Cache-Control':'max-age=0',
'Connection':'keep-alive',
'Proxy-Authorization':'Basic ZWRjZ3Vlc3Q6ZWRjZ3Vlc3Q=',
'If-Modified-Since':'Fri, 13 Nov 2015 17:47:23 GMT',
'User-Agent': 'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/48.0.2564.116 Safari/537.36'
}
There is another way to solve this problem.
What you can do is let your python script to use the proxy defined in your environment variable
Open terminal (CTRL + ALT + T)
export http_proxy="http://usr:pass#proxy:port"
export https_proxy="https://usr:pass#proxy:port"
and remove the proxy lines from your code
Here is the changed code:
#!/usr/bin/python3.4
import requests
from bs4 import BeautifulSoup
import urllib.request as req
url = input("Ask user for something")
def santabanta(max_pages,url):
page = 1
while (page <= max_pages):
source_code = requests.get(url)
plain_text = source_code.text
print (plain_text)
soup = BeautifulSoup(plain_text,"lxml")
for link in soup.findAll('a'):
href = link.get('href')
print(href)
page = page + 1
santabanta(1,url)

Categories

Resources