Scraping More than rendered Data with Beautiful Soup - python

I'm scraping apps names from Google Play Store and for each URL as input I get only 60apps (because the website rendered 60apps if the user doesn't scroll down). How is it working and how can I scrape all the apps from a page using BeautifulSoup and/or Selenium ?
Thank you
Here is my code :
urls = []
urls.extend(["https://play.google.com/store/apps/category/NEWS_AND_MAGAZINES/collection/topselling_paid"])
for i in urls:
response = get(i)
html_soup = BeautifulSoup(response.text, 'html.parser')
app_container = html_soup.find_all('div', class_="card no-rationale square-cover apps small")
file = open("./InputFiles/applications.txt","w+")
for i in range(0, len(app_container)):
#print(app_container[i].div['data-docid'])
file.write(app_container[i].div['data-docid'] + "\n")
file.close()
num_lines = sum(1 for line in open('./InputFiles/applications.txt'))
print("Applications : " + str(num_lines) )

In this case You need to use Selenium . I try it for you an get the all apps . I will try to explain hope will understand.
Using Selenium is more powerful than other Python function .I used ChromeDriver so If you don't install yet You can install it in
http://chromedriver.chromium.org/
from time import sleep
from selenium import webdriver
options = webdriver.ChromeOptions()
driver=webdriver.Chrome(chrome_options=options,
executable_path=r'This part is your Driver path')
driver.get('https://play.google.com/store/apps/category/NEWS_AND_MAGAZINES/collection/topselling_paid')
driver.execute_script("window.scrollTo(0, document.body.scrollHeight);") ## Scroll to bottom of page with using driver
sleep(5) ## Give a delay for allow to page scroll . If we dont program will already take 60 element without letting scroll
x = driver.find_elements_by_css_selector("div[class='card-content id-track-click id-track-impression']") ## Declare which class
for a in x:
print a.text
driver.close()
OUTPUT :
1. Pocket Casts
Podcast Media LLC
₺24,99
2. Broadcastify Police Scanner Pro
RadioReference.com LLC
₺18,99
3. Relay for reddit (Pro)
DBrady
₺8,00
4. Sync for reddit (Pro)
Red Apps LTD
₺15,00
5. reddit is fun golden platinum (unofficial)
TalkLittle
₺9,99
... **UP TO 75**
Note :
Dont mind the money. Its my countr currency so It will change in yours.
UPDATE ACCORDİNG TO YOUR COMMENT:
The same data-docid is also in span tag.You can get it with using get_attribute . Just add below codes into your project.
y = driver.find_elements_by_css_selector("span[class=preview-overlay-container]")
for b in y :
print b.get_attribute('data-docid')
OUTPUT
au.com.shiftyjelly.pocketcasts
com.radioreference.broadcastifyPro
reddit.news
com.laurencedawson.reddit_sync.pro
com.andrewshu.android.redditdonation
com.finazzi.distquakenoads
com.twitpane.premium
org.fivefilters.kindleit
.... UP TO 75

Google Play has recently changed the user interface and structure of links and the display of information. I recently wrote a Scrape Google Play Search Apps in Python blog where I described the whole process in detail with more data.
In order to access all apps, you need to scroll to the bottom of the page.
After that, you can start extracting the app names and then writing them into a file. Extraction selectors have also changed e.g. new selectors.
Code and full example in online IDE:
from selenium import webdriver
from selenium.webdriver.chrome.service import Service
from webdriver_manager.chrome import ChromeDriverManager
from bs4 import BeautifulSoup
import time
urls = []
urls.extend(["https://play.google.com/store/apps?device=phone&hl=en_GB&gl=US"])
service = Service(ChromeDriverManager().install())
options = webdriver.ChromeOptions()
options.add_argument("--headless")
options.add_argument("--lang=en")
options.add_argument("user-agent=Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/104.0.0.0 Safari/537.36")
options.add_argument("--no-sandbox")
options.add_argument("--disable-dev-shm-usage")
driver = webdriver.Chrome(service=service, options=options)
for url in urls:
driver.get(url)
# scrolling page
while True:
try:
driver.execute_script("document.querySelector('.snByac').click();")
time.sleep(2)
break
except:
driver.execute_script("window.scrollTo(0, document.body.scrollHeight);")
time.sleep(2)
soup = BeautifulSoup(driver.page_source, "lxml")
driver.quit()
with open("applications.txt", "w+") as file:
for result in soup.select(".Epkrse"):
file.write(result.text + "\n")
num_lines = sum(1 for line in open("applications.txt"))
print("Applications : " + str(num_lines))
Output:
Applications : 329
Also, you can use Google Play Apps Store API from SerpApi. It will bypass blocks from search engines and you don't have to create the parser from scratch and maintain it.
Code example:
from serpapi import GoogleSearch
import os
params = {
# https://docs.python.org/3/library/os.html#os.getenv
'api_key': os.getenv('API_KEY'), # your serpapi api
'engine': 'google_play', # SerpApi search engine
'store': 'apps' # Google Play Games
}
data = []
while True:
search = GoogleSearch(params) # where data extraction happens on the SerpApi backend
result_dict = search.get_dict() # JSON -> Python dict
if result_dict.get('organic_results'):
for result in result_dict.get('organic_results'):
for item in result['items']:
data.append(item['title'])
next_page_token = result_dict['serpapi_pagination']['next_page_token']
params['next_page_token'] = next_page_token
else:
break
with open('applications.txt', 'w+') as file:
for app in data:
file.write(app + "\n")
num_lines = sum(1 for line in open('applications.txt'))
print('Applications : ' + str(num_lines))
The output will be the same.

Related

Web Scraping price AirBnB data with Python

I have been trying to web scrape an air bnb website to obtain the price without much luck. I have successfully been able to bring in the other areas of interest (home description, home location, reviews, etc). Below is what I've tried unsuccessfully. I think that the fact the "price" on the web page is a 'span class' as opposed to the others which are 'div class' is where my issue is, but I'm speculating.
The URL I'm using is: https://www.airbnb.com/rooms/52361296?category_tag=Tag%3A8173&adults=4&children=0&infants=0&check_in=2022-12-11&check_out=2022-12-18&federated_search_id=6174a078-a823-4fad-827a-7ca652b5e786&source_impression_id=p3_1645454076_foOVSAshSYvdbpbS
This can be placed as the input in the below code.
Any assistance would be greatly appreciated.
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn
from bs4 import BeautifulSoup
import requests
from IPython.display import IFrame
input_string = input("""Enter URLs for AirBnB sites that you want webscraped AND separate by a ',' : """)
airbnb_list = []
try:
airbnb_list = input_string.split(",")
x = 0
y = len(airbnb_list)
while y >= x:
print(x+1 , '.) ' , airbnb_list[x])
x=x+1
if y == x:
break
#print(airbnb_list[len(airbnb_list)])
except:
print("""Please separate list by a ','""")
a = pd.DataFrame([{"Title":'', "Stars": '', "Size":'', "Check In":'', "Check Out":'', "Rules":'',
"Location":'', "Home Type":'', "House Desc":''}])
for x in range(len(airbnb_list)):
url = airbnb_list[x]
soup = BeautifulSoup(requests.get(url).content, 'html.parser')
stars = soup.find(class_='_c7v1se').get_text()
desc = soup.find(class_='_12nksyy').get_text()
size = soup.find(class_='_jro6t0').get_text()
#checkIn = soup.find(class_='_1acx77b').get_text()
checkIn = soup.find(class_='_12aeg4v').get_text()
#checkOut = soup.find(class_='_14tl4ml5').get_text()
checkOut = soup.find(class_='_12aeg4v').get_text()
Rules = soup.find(class_='cihcm8w dir dir-ltr').get_text()
#location = soup.find(class_='_9ns6hl').get_text()
location = soup.find(class_='_152qbzi').get_text()
HomeType = soup.find(class_='_b8stb0').get_text()
title = soup.title.string
print('Stars: ', stars)
print('')
#Home Type
print('Home Type: ', HomeType)
print('')
#Space Description
print('Description: ', desc)
print('')
print('Rental size: ',size)
print('')
#CheckIn
print('Check In: ', checkIn)
print('')
#CheckOut
print('Check Out: ', checkOut)
print('')
#House Rules
print('House Rules: ',Rules)
print('')
#print(soup.find("button", {"id":"#Id name of the button"}))
#Home Location
print('Home location: ', location)
#Dates available
#print('Dates available: ', soup.find(class_='_1yhfti2').get_text())
print('===================================================================================')
df = pd.DataFrame([{"Title":title, "Stars": stars, "Size":size, "Check In":checkIn, "Check Out":checkOut, "Rules":Rules,
"Location":location, "Home Type":HomeType, "House Desc":desc}])
a = a.append(df)
#Attemping to print the price tag on the website
print(soup.find_all('span', {'class': '_tyxjp1'}))
print(soup.find(class_='_tyxjp1').get_text())
---------------------------------------------------------------------------
AttributeError Traceback (most recent call last)
<ipython-input-10-2d9689dbc836> in <module>
1 #print(soup.find_all('span', {'class': '_tyxjp1'}))
----> 2 print(soup.find(class_='_tyxjp1').get_text())
AttributeError: 'NoneType' object has no attribute 'get_text'
I see you are using the requests module to scrape airbnb.
That module is extremely versatile and works on websites that have static content.
However, it has one major drawback: it doesn't render content created by javascript.
This is a problem, as most of the websites these days create additional html elements using javascript once the user lands on the web page.
The airbnb price block is created exactly like that - using javascript.
There are many ways to scrape that kind of content.
My favourite way is to use selenium.
It's basically a library that allows you to launch a real browser and communicate with it using your programming language of choice.
Here's how you can easily use selenium.
First, set it up. Notice the headless option which can be toggled on and off.
Toggle it off if you want to see how the browser loads the webpage
# setup selenium (I am using chrome here, so chrome has to be installed on your system)
chromedriver_autoinstaller.install()
options = Options()
# if you set this to False if you want to see how the chrome window loads airbnb - useful for debugging
options.headless = True
driver = webdriver.Chrome(options=options)
Then, navigate to the website
# navigate to airbnb
driver.get(url)
Next, wait until the price block loads.
It might appear near instantaneous to us, but depending on the speed of your internet connection it might take a few seconds
# wait until the price block loads
timeout = 10
expectation = EC.presence_of_element_located((By.CSS_SELECTOR, '._tyxjp1'))
price_element = WebDriverWait(driver, timeout).until(expectation)
And finally, print the price
# print the price
print(price_element.get_attribute('innerHTML'))
I added my code to your example so you could play around with it
import chromedriver_autoinstaller
from selenium import webdriver
from selenium.webdriver.chrome.options import Options
from selenium.webdriver.support import expected_conditions as EC
from selenium.webdriver.support.ui import WebDriverWait
import pandas as pd
from bs4 import BeautifulSoup
import requests
from selenium.webdriver.common.by import By
input_string = input("""Enter URLs for AirBnB sites that you want webscraped AND separate by a ',' : """)
airbnb_list = []
try:
airbnb_list = input_string.split(",")
x = 0
y = len(airbnb_list)
while y >= x:
print(x+1 , '.) ' , airbnb_list[x])
x=x+1
if y == x:
break
#print(airbnb_list[len(airbnb_list)])
except:
print("""Please separate list by a ','""")
a = pd.DataFrame([{"Title":'', "Stars": '', "Size":'', "Check In":'', "Check Out":'', "Rules":'',
"Location":'', "Home Type":'', "House Desc":''}])
# setup selenium (I am using chrome here, so chrome has to be installed on your system)
chromedriver_autoinstaller.install()
options = Options()
# if you set this to False if you want to see how the chrome window loads airbnb - useful for debugging
options.headless = True
driver = webdriver.Chrome(options=options)
for x in range(len(airbnb_list)):
url = airbnb_list[x]
soup = BeautifulSoup(requests.get(url).content, 'html.parser')
# navigate to airbnb
driver.get(url)
# wait until the price block loads
timeout = 10
expectation = EC.presence_of_element_located((By.CSS_SELECTOR, '._tyxjp1'))
price_element = WebDriverWait(driver, timeout).until(expectation)
# print the price
print(price_element.get_attribute('innerHTML'))
Keep in mind that your IP might eventually get banned for scraping AirBnb.
To work around that it is always a good idea to use proxy IPs and rotate them.
Follow this rotating proxies tutorial to avoid getting blocked.
Hope that helps!

Scraping HTML code using Selenium with Python

I was trying to scrape some image data from some stores. For example, I was looking at some images from Nordstrom (tested with 'https://www.nordstrom.com/browse/men/clothing/sweaters').
I had initially used requests.get() to get the code, but I noticed that I was getting some javascript -- and upon further researc I found that this occured because it was dynamically loaded in the html using javascript.
To remedy this issue, following this post (Python requests.get(url) returning javascript code instead of the page html), I tried to use selenium to get the html code. However, I still ran into issues trying to access all the html: it was still returning alot of javascript. Finally, I added in some time delay as I thought maybe it needed some time to load in all of the html, but this still failed. Is there a way to get all the html using selenium? I have attached the current code below:
from bs4 import BeautifulSoup
import requests
from selenium import webdriver
from selenium.webdriver.chrome.options import Options
import time
def create_browser(webdriver_path):
#create a selenium object that mimics the browser
browser_options = Options()
#headless tag created an invisible browser
browser_options.add_argument("--headless")
browser_options.add_argument('--no-sandbox')
browser = webdriver.Chrome(webdriver_path, chrome_options=browser_options)
print("Done Creating Browser")
return browser
url = 'https://www.nordstrom.com/browse/men/clothing/sweaters'
browser = create_browser('path/to/chromedriver_win32/chromedriver.exe')
browser.implicitly_wait(10)
browser.get(url)
time.sleep(10)
html_source = browser.page_source
print(html_source)
Is there something that I am not doing properly to load in all of the html code?
browser.page_source always returns initial HTML source but not current DOM state. Try
time.sleep(10)
html_source = browser.find_element_by_tag_name('html').get_attribute('outerHTML')
I would recommend reading "Test-Driven Development with Python", you'll get an answer for your question and so many more. You can read it for free here: https://www.obeythetestinggoat.com/ (and then you can also buy it ;-) )
Regarding your question, you have to wait that the element you're looking for is effectively loaded. You may use time.sleep but you'll get strange behavior depending on the speed of your internet connection and browser.
A better solution is explained here in depth: https://www.obeythetestinggoat.com/book/chapter_organising_test_files.html#self.wait-for
You can use the proposed solution:
def wait_for(self, fn):
start_time = time.time()
while True:
try:
return fn()
except (AssertionError, WebDriverException) as e:
if time.time() - start_time > MAX_WAIT:
raise e
time.sleep(0.5)
fn is just a function finding the element in the page.
Just add a user agent. Chrome's headless user agent says headless that is the problem.
from bs4 import BeautifulSoup
import requests
from selenium import webdriver
from selenium.webdriver.chrome.options import Options
def create_browser(webdriver_path):
#create a selenium object that mimics the browser
browser_options = Options()
browser_options.add_argument('--headless')
browser_options.add_argument('--user-agent="Mozilla/5.0 (Windows NT 4.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/37.0.2049.0 Safari/537.36"')
browser = webdriver.Chrome(webdriver_path, options=browser_options)
print("Done Creating Browser")
return browser
url = 'https://www.nordstrom.com/browse/men/clothing/sweaters'
browser = create_browser('C:/bin/chromedriver.exe')
browser.implicitly_wait(10)
browser.get(url)
divs = browser.find_elements_by_tag_name('a')
for div in divs:
print(div.text)
Output:-
Displays all links on the page..
Patagonia Better Sweater® Quarter Zip Pullover
(56)
Nordstrom Men's Shop Regular Fit Cashmere Quarter Zip Pullover (Regular & Tall)
(73)
Nordstrom Cashmere Crewneck Sweater
(51)
Cutter & Buck Lakemont Half Zip Sweater
(22)
Nordstrom Washable Merino Quarter Zip Sweater
(2)
ALLSAINTS Mode Slim Fit Merino Wool Sweater
Process finished with exit code -1

Web scraping in Python - extract a value from website

I'm trying to extract two values from this website:
bizportal.co.il
One value is the dollar rate from the right, and from the left the drop/rise in percentage.
The problem is that, after I'm getting the dollar rate value, the number is rounded from some reason. (You can see in the terminal). I want to get the exactly number as shown in the website.
Is there some friendly documentation for web scraping in Python?
P.S: how can I get rid of the pop up Python terminal window when running a code in VS ? I just want the output will be in VS - in the interactive window.
my_url = "https://www.bizportal.co.il/forex/quote/generalview/22212222"
uClient = urlopen(my_url)
page_html = uClient.read()
uClient.close()
page_soup = BeautifulSoup(page_html, "html.parser")
div_class = page_soup.findAll("div",{"class":"data-row"})
print (div_class)
#print(div_class[0].text)
#print(div_class[1].text)
The data is loaded dynamically via Ajax, but you can simulate this request with requests module:
import json
import requests
url = 'https://www.bizportal.co.il/forex/quote/generalview/22212222'
ajax_url = "https://www.bizportal.co.il/forex/quote/AjaxRequests/DailyDeals_Ajax?paperId={paperId}&take=20&skip=0&page=1&pageSize=20"
paper_id = url.rsplit('/')[-1]
headers = {'User-Agent': 'Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:76.0) Gecko/20100101 Firefox/76.0'}
data = requests.get(ajax_url.format(paperId=paper_id), headers=headers).json()
# uncomment this to print all data:
#print(json.dumps(data, indent=4))
# print first one
print(data['Data'][0]['rate'], data['Data'][0]['PrecentageRateChange'])
Prints:
3.4823 -0.76%
The problem is this element is being dynamically updated with Javascript. You will not be able to scrape the 'up to date' value with urllib or requests. When the page is loaded, it has a recent value populated (likely from a database) and then it is replaced with the real time number via Javascript.
In this case it would be better to use something like Selenium, to load the webpage - this allows the javascript to execute on the page, and then scrape the numbers.
from selenium import webdriver
from selenium.webdriver.chrome.options import Options
import time
options = Options()
options.add_argument("--headless") # allows you to scrape page without opening the browser window
driver = webdriver.Chrome('./chromedriver', options=options)
driver.get("https://www.bizportal.co.il/forex/quote/generalview/22212222")
time.sleep(1) # put in to allow JS time to load, sometimes works without.
values = driver.find_elements_by_class_name('num')
price = values[0].get_attribute("innerHTML")
change = values[1].find_element_by_css_selector("span").get_attribute("innerHTML")
print(price, "\n", change)
Output:
╰─$ python selenium_scrape.py
3.483
-0.74%
You should familiarize yourself with Selenium, understand how to set it up, and run it - this includes installing the browser (in this case I am using Chrome, but you can use others), understanding where to get the browser driver (Chromedriver in this case) and understand how to parse the page. You can learn all about it here https://www.selenium.dev/documentation/en/

I am scraping a site using selenium ,beautifulsoup. Need total no. of pages in website or other way to navigate pages

I am using selenium webdriver and beautiful soup to scrape a website which has a variable number of multiple pages. I am doing it crudely through xpath. A page shows five pages and after count is five I press the next button and reset the xpath count to get next 5 pages. For this I need total pages in the website through the code or a better way of navigating to different pages.
I think the page uses angular java script for navigation. The code is the following:
import requests
from bs4 import BeautifulSoup
from selenium import webdriver
driver = webdriver.Chrome()
driver.maximize_window()
spg_index=' '
url = "https://www.bseindia.com/corporates/ann.html"
driver.get(url)
soup = BeautifulSoup(driver.page_source, 'html.parser')
html=soup.prettify()
with open('bseann.txt', 'w', encoding='utf-8') as f:
f.write(html)
time.sleep(1)
i=1 #index for page numbers navigated. ket at maximum 31 at present
k=1 #goes upto 5, the maximum navigating pages shown at one time
while i <31:
next_pg=9 #xpath number to pinpoint to "next" page
snext_pg=str(next_pg)
snext_pg=snext_pg.strip()
if i> 5:
next_pg=10 #when we go to next set of pages thr is a addl option
if(i==6) or(i==11)or(i==16):#resetting xpath indx for set of pg's
k=2
path='/html/body/div[1]/div[5]/div[2]/div[1]/div[1]/ul/li['
path=path+snext_pg+']/a'
next_page_btn_list=driver.find_elements_by_xpath(path)
next_page_btn=next_page_btn_list[0]
next_page_btn.click() #click next page
time.sleep(1)
pg_index= k+2
spg_index=str(pg_index)
spg_index=spg_index.strip()
path= '/html/body/div[1]/div[5]/div[2]/div[1]/div[1]/ul/li['
path=path+spg_index+']/a'
next_page_btn_list=driver.find_elements_by_xpath(path)
next_page_btn=next_page_btn_list[0]
next_page_btn.click() #click specific pg no.
time.sleep(1)
soup = BeautifulSoup(driver.page_source, 'html.parser')
html=soup.prettify()
i=i+1
k=k+1
with open('bseann.txt', 'a', encoding='utf-8') as f:
f.write(html)
No need to use Selenium here as you can access the info from the API. This pulled 247 announcements:
import requests
from pandas.io.json import json_normalize
url = 'https://api.bseindia.com/BseIndiaAPI/api/AnnGetData/w'
headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/73.0.3683.103 Safari/537.36'}
payload = {
'strCat': '-1',
'strPrevDate': '20190423',
'strScrip': '',
'strSearch': 'P',
'strToDate': '20190423',
'strType': 'C'}
jsonData = requests.get(url, headers=headers, params=payload).json()
df = json_normalize(jsonData['Table'])
df['ATTACHMENTNAME'] = '=HYPERLINK("https://www.bseindia.com/xml-data/corpfiling/AttachLive/' + df['ATTACHMENTNAME'] + '")'
df.to_csv('C:/filename.csv', index=False)
Output:
...
GYSCOAL ALLOYS LTD. - 533275 - Announcement under Regulation 30 (LODR)-Code of Conduct under SEBI (PIT) Regulations, 2015
https://www.bseindia.com/xml-data/corpfiling/AttachLive/82f18673-de98-4a88-bbea-7d8499f25009.pdf
INDIAN SUCROSE LTD. - 500319 - Certificate Under Regulation 40(9) Of Listing Regulation For The Half Year Ended 31.03.2019
https://www.bseindia.com/xml-data/corpfiling/AttachLive/2539d209-50f6-4e56-a123-8562067d896e.pdf
Dhanvarsha Finvest Ltd - 540268 - Reply To Clarification Sought From The Company
https://www.bseindia.com/xml-data/corpfiling/AttachLive/f8d80466-af58-4336-b251-a9232db597cf.pdf
Prabhat Telecoms (India) Ltd - 540027 - Signing Of Framework Supply Agreement With METRO Cash & Carry India Private Limited
https://www.bseindia.com/xml-data/corpfiling/AttachLive/acfb1f72-efd3-4515-a583-2616d2942e78.pdf
...
A bit of more information about your usecase would have helped to answer your question. However to extract the information about the total number of pages within the website you can access the site, click on the item with text as Next and extract the required data and you can use the following solution:
Code Block:
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
options = webdriver.ChromeOptions()
options.add_argument("start-maximized")
options.add_argument("--disable-extensions")
# options.add_argument('disable-infobars')
driver = webdriver.Chrome(chrome_options=options, executable_path=r'C:\Utility\BrowserDrivers\chromedriver.exe')
driver.get("https://www.bseindia.com/corporates/ann.html")
WebDriverWait(driver, 20).until(EC.element_to_be_clickable((By.XPATH, "//a[text()='Disclaimer']//following::div[1]//li[#class='pagination-last ng-scope']/a[#class='ng-binding' and text()='Last']"))).click()
print(WebDriverWait(driver, 20).until(EC.visibility_of_element_located((By.XPATH, "//a[text()='Disclaimer']//following::div[1]//li[#class='pagination-page ng-scope active']/a[#class='ng-binding']"))).get_attribute("innerHTML"))
Console Output:
17

Python/Selenium "hover-and-click" not working on WebElement whose class changes at hover

I'm using Selenium library on Python to scrape a website written on js. My strategy is moving through the website using selenium and, at the right time, scraping with BeautifulSoup. This works just fine on simple tests, except when, as shown in the following picture,
I need to click on the "<" button.
The "class" of the button changes at hover, so I'm using ActionChains to move to the element and click on it (I'm also using sleep to give enough time for the browser to load the page). Python is not throwing any exception, but the click is not working (i.e. the calendar is not moving backwards).
Below I provide the mentioned website and the code I wrote with an example. Do you have any idea why this is happening and/or how can I overcome this issue? Thank you very very much.
Website = https://burocomercial.profeco.gob.mx/index.jsp
Code:
from selenium import webdriver
from selenium.webdriver.common.action_chains import ActionChains
import time
driver = webdriver.Chrome(path_to_webdriver)
driver.get('https://burocomercial.profeco.gob.mx/index.jsp') #access website
# Search bar and search button
search_bar = driver.find_elements_by_xpath('//*[#id="txtbuscar"]')
search_button = driver.find_element_by_xpath('//*[#id="contenido"]/div[2]/div[2]/div[2]/div/div[2]/div/button')
# Perform search
search_bar[0].send_keys("inmobiliaria")
search_button.click()
# Select result
time.sleep(2)
xpath='//*[#id="resultados"]/div[4]/table/tbody/tr[1]/td[5]/button'
driver.find_elements_by_xpath(xpath)[0].click()
# Open calendar
time.sleep(5)
driver.find_element_by_xpath('//*[#id="calI"]').click() #opens calendar
time.sleep(2)
# Hover-and-click on "<" (Here's the problem!!!)
cal_button=driver.find_element_by_xpath('//div[#id="ui-datepicker-div"]/div/a')
time.sleep(4)
ActionChains(driver).move_to_element(cal_button).perform() #hover
prev_button = driver.find_element_by_class_name('ui-datepicker-prev') #catch element whose class was changed by the hover
ActionChains(driver).click(prev_button).perform() #click
time.sleep(1)
print('clicked on it a second ago. No exception was raised, but the click was not performed')
time.sleep(1)
This is a different approach using requests. I think that Selenium should be the last option to use when doing webscraping. Usually, It is possible to retrieve the data from a webpage emulating the requests made by the web application
import requests
from bs4 import BeautifulSoup as BS
headers = {'User-Agent':'Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/60.0.3112.101 Safari/537.36'}
## Starts session
s = requests.Session()
s.headers = headers
url_base = 'https://burocomercial.profeco.gob.mx/'
ind = 'index.jsp'
resp0 = s.get(url_base+ind) ## First request, to get the 'name' parameter that is dynamic
soup0 = BS(resp0.text, 'lxml')
param_name = soup0.select_one('input[id="txtbuscar"]')['name']
action = 'BusGeneral' ### The action when submit the form
keyword = 'inmobiliaria' # Word to search
data_buscar = {param_name:keyword,'yy':'2017'} ### Data submitted
resp1 = s.post(url_base+action,data=data_buscar) ## second request: make the search
resp2 = s.get(url_base+ind) # Third request: retrieve the results
print(resp2.text)
queja = 'Detalle_Queja.jsp' ## Action when Quejas selected
data_queja = {'Lookup':'2','Val':'1','Bus':'2','FI':'28-Nov-2016','FF':'28-Feb-2017','UA':'0'} # Data for queja form
## Lookup is the number of the row in the table, FI is the initial date and FF, the final date, UA is Unidad Administrativa
## You can change these parameters to obtain different queries.
resp3 = s.post(url_base+queja,data=data_queja) # retrieve Quejas results
print(resp3.text)
With this I got:
'\r\n\r\n\r\n\r\n\r\n\r\n1|<h2>ABITARE PROMOTORA E INMOBILIARIA, SA DE CV</h2>|0|0|0|0.00|0.00|0|0.00|0.00|0.00|0.00|0 % |0 % ||2'
Which contains the data that is used in the webpage.
Maybe this answer is not exactly what you are looking for, but I think it could be easier for you to use requests.
You don't need to hover the <, just click it.
from selenium import webdriver
from selenium.webdriver.support import expected_conditions as EC
driver = webdriver.Chrome(path_to_webdriver)
driver.get('https://burocomercial.profeco.gob.mx/index.jsp') #access website
# set up wait
wait = WebDriverWait(driver, 10)
# Perform search
driver.find_element_by_id('txtbuscar').send_keys("inmobiliaria")
driver.find_element_by_css_selector('button[alt="buscar"]').click()
# Select result
xpath='//*[#id="resultados"]/div[4]/table/tbody/tr[1]/td[5]/button'
wait.until(EC.element_to_be_clickable((By.XPATH, xpath))).click()
# Open calendar
wait.until(EC.element_to_be_clickable((By.ID, 'calI'))).click() #opens calendar
wait.until(EC.element_to_be_visible((By.ID, 'ui-datepicker-div'))
# Click on "<"
wait.until(EC.element_to_be_clickable((By.CSS_SELECTOR, 'a[title="Ant"]'))).click()
A few things
If your XPath consists only of an ID, just use .find_element_by_id(). It's faster and easier to read.
If you are only using the first element in a collection, e.g. search_bar, just use .find_element_* instead of .find_elements_* and search_bar[0].
Don't use sleeps. Sleeps are a bad practice and result in unreliable tests. Instead use expected conditions, e.g. to wait for an element to be clickable.

Categories

Resources