get string from pandas column shorter than real

get string from pandas column shorter than real - python

I have a df like this:
Sucursal
url
3
https://www.google.com/maps/place/Moov/#-34.4718341,-58.5170394,17z/data=!3m1!4b1!4m5!3m4!1s0x95bcb1c7fd6203c3:0x37a2af54c5fe1a!8m2!3d-34.4718341!4d-58.5148507
11
https://www.google.com/maps/place/Moov/#-34.6496456,-58.6220989,17z/data=!3m1!4b1!4m5!3m4!1s0x95bcc7612176b9a3:0x2c6fcf2164a536b9!8m2!3d-34.6496453!4d-58.6199217
(Real df has more than 100 rows)
I´m using Selenium to get some reviews from google, so I need to loop the urls from "url" column.
This is the first part of the script :
for i in (range(0,2)): #I´m testing with some elements to see if it works
driver = webdriver.Chrome()
url=str(df.iloc[[i]]['url']).split()[1]
driver.get(url)
time.sleep(5)
driver.find_element_by_xpath('//*[#id="QA0Szd"]/div/div/div[1]/div[2]/div/div[1]/div/div/div[2]/div[1]/div[1]/div[2]/div/div[1]/div[2]/span[2]/span[1]/button').click()
time.sleep(1)
and so on
the problem is when I get the driver, the url pasted in the rendered page is : "https://www.google.com/maps/place/Moov/#-34.4718341,-58.5170394,17z"
It happens the same with all urls strings...I don´t know why it´s not getting 100% of the string.
If I print as value:
print(df['url'].values)
The url string is ok...but not when it opens the chrome driver...
What am I doing wrong?
Thanks in advance!

I used tsv file to generate df based on your data
import contextlib
import pandas
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.chrome.service import Service
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from selenium.common.exceptions import TimeoutException
options = webdriver.ChromeOptions()
options.add_experimental_option("excludeSwitches", ["enable-automation", "enable-logging"])
# path to your chromedriver
service = Service(executable_path='your-path')
driver = webdriver.Chrome(service=service, options=options)
wait = WebDriverWait(driver, 10)
df = pandas.read_csv('grev.tsv', sep='\t')
# accept cookies
driver.get('https://www.google.com/maps')
with contextlib.suppress(TimeoutException):
wait.until(EC.presence_of_element_located((By.CSS_SELECTOR, 'form[action]:nth-child(2)'))).click()
# open links from df and get reviews
for url in df['url']:
driver.get(url)
# click on the "More" button for reviews
driver.execute_script("for (let elem of document.querySelectorAll('jsl>button')) {elem.click()}")
# get the first 3 reviews
reviews = driver.find_elements(By.CSS_SELECTOR, 'div[data-review-id]:first-child')
for review in reviews:
print(review.find_element(By.CSS_SELECTOR, 'div[class="MyEned"]').text)
driver.close()

Related

How to extract the comments count correctly

I am trying to extract number of youtube comments and tried several methods.
My Code:
from selenium import webdriver
import pandas as pd
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.common.by import By
from selenium.webdriver.support import expected_conditions as EC
import time
DRIVER_PATH = <your chromedriver path>
wd = webdriver.Chrome(executable_path=DRIVER_PATH)
url = 'https://www.youtube.com/watch?v=5qzKTbnhyhc'
wd.get(url)
wait = WebDriverWait(wd, 100)
time.sleep(40)
v_title = wd.find_element_by_xpath('//*[#id="container"]/h1/yt-formatted-string').text
print("title Is ")
print(v_title)
comments_xpath = '//h2[#id="count"]/yt-formatted-string/span[1]'
v_comm_cnt = wait.until(EC.visibility_of_element_located((By.XPATH, comments_xpath)))
#wd.find_element_by_xpath(comments_xpath)
print(len(v_comm_cnt))
I get the following error:
selenium.common.exceptions.TimeoutException: Message:
I get correct value for title but not for comment_cnt. Can any one please guide me what is wrong with my code?
Please note that comments count path - //h2[#id="count"]/yt-formatted-string/span[1] point to correct place if I search the value in inspect element.

Updated answer
Well, it was tricky!
There are several issues here:
This page has some bad java scripts on it making the Selenium webdriver driver.get() method to wait until the timeout for the page loaded while it looks like the page is loaded. To overcome that I used Eager page load strategy.
This page has several blocks of code for the same areas so as sometimes one of them is used (visible) and sometimes the second. This makes working with element locators difficultly. So, here I am waiting for visibility of title element from one of that blocks. In case it was visible - I'm extracting the text from there, otherwise I'm waiting for the visibility of the second element (it comes immediately) and extracting the text from there.
There are several ways to make page scrolling. Not all of them worked here. I found the one that is working and scrolling not too much.
The code below is 100% working, I run it several times.
from selenium import webdriver
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.common.by import By
from selenium.webdriver.support import expected_conditions as EC
from selenium.webdriver.chrome.options import Options
from selenium.webdriver.common.desired_capabilities import DesiredCapabilities
from selenium.webdriver.chrome.service import Service
options = Options()
options.add_argument("--start-maximized")
caps = DesiredCapabilities().CHROME
caps["pageLoadStrategy"] = "eager"
s = Service('C:\webdrivers\chromedriver.exe')
driver = webdriver.Chrome(options=options, desired_capabilities=caps, service=s)
url = 'https://www.youtube.com/watch?v=5qzKTbnhyhc'
driver.get(url)
driver.maximize_window()
wait = WebDriverWait(driver, 10)
title_xpath = "//div[#class='style-scope ytd-video-primary-info-renderer']/h1"
alternative_title = "//*[#id='title']/h1"
v_title = ""
try:
v_title = wait.until(EC.visibility_of_element_located((By.XPATH, title_xpath))).text
except:
v_title = wait.until(EC.visibility_of_element_located((By.XPATH, alternative_title))).text
print("Title is " + v_title)
comments_xpath = "//div[#id='title']//*[#id='count']//span[1]"
driver.execute_script("window.scrollBy(0, arguments[0]);", 600)
try:
v_comm_cnt = wait.until(EC.visibility_of_element_located((By.XPATH, comments_xpath)))
except:
pass
v_comm_cnt = driver.find_element(By.XPATH, comments_xpath).text
print("Video has " + v_comm_cnt + " comments")
The output is:
Title is Music for when you are stressed 🍀 Chil lofi | Music to Relax, Drive, Study, Chill
Video has 834 comments
Process finished with exit code 0

Scraping webpage with tabs that do not change url

I am trying to scrape Nasdaq webpage and have some issue with locating elements:
My code:
from selenium import webdriver
import time
import pandas as pd
driver.get('http://www.nasdaqomxnordic.com/shares/microsite?Instrument=CSE32679&symbol=ALK%20B&name=ALK-Abell%C3%B3%20B')
time.sleep(5)
btn_overview = driver.find_element_by_xpath('//*[#id="tabarea"]/section/nav/ul/li[2]/a')
btn_overview.click()
time.sleep(5)
employees = driver.find_element_by_xpath('//*[#id="CompanyProfile"]/div[6]')
After the last call, I receive the following error:
NoSuchElementException: no such element: Unable to locate element: {"method":"xpath","selector":"//*[#id="CompanyProfile"]/div[6]"}
Normally the problem would be in wrong 'xpath' but I tried several items, also by 'id'. I suspect that it has something to do with tabs (in my case navigating to "Overview"). Visually the webpage changes, but if for example, I scrape the table, it gets it from the first page:
table_test = pd.read_html(driver.page_source)[0]
What am I missing or doing wrong?

The overview page is under iframe
from selenium import webdriver
from selenium.webdriver.chrome.service import Service
from webdriver_manager.chrome import ChromeDriverManager
from selenium.webdriver.chrome.options import Options
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.common.by import By
from selenium.webdriver.support import expected_conditions as EC
option = webdriver.ChromeOptions()
option.add_argument("start-maximized")
#chrome to stay open
option.add_experimental_option("detach", True)
driver = webdriver.Chrome(service=Service(ChromeDriverManager().install()),options=option)
driver.get('http://www.nasdaqomxnordic.com/shares/microsite?Instrument=CSE32679&symbol=ALK%20B&name=ALK-Abell%C3%B3%20B')
WebDriverWait(driver, 20).until(EC.element_to_be_clickable((By.XPATH, '//*[#id="tabarea"]/section/nav/ul/li[2]/a'))).click()
WebDriverWait(driver, 20).until(EC.element_to_be_clickable((By.XPATH, '//*[#id="cookieConsentOK"]'))).click()
WebDriverWait(driver, 20).until(EC.frame_to_be_available_and_switch_to_it((By.CSS_SELECTOR,"iframe#MorningstarIFrame")))
employees=WebDriverWait(driver, 20).until(EC.presence_of_element_located((By.XPATH, '//*[#id="CompanyProfile"]/div[6]'))).text.split()[1]
print(employees)
Output:
2,537
webdriverManager

You sure you need Selenium?
import requests
from bs4 import BeautifulSoup
url = 'http://lt.morningstar.com/gj8uge2g9k/stockreport/default.aspx'
payload = {
'SecurityToken': '0P0000A5LL]3]1]E0EXG$XCSE_3060'}
response = requests.get(url, params=payload)
soup = BeautifulSoup(response.text, 'html.parser')
employees = soup.find('h3', text='Employees').next_sibling.text
print(employees)
Output:
2,537

Selenium Web Scraping: Find element by text not working in script

I am working on a script to gather information off Newegg to look at changes over time in graphics card prices. Currently, my script will open a Newegg search on RTX 3080's through Chromedriver and then click on the link for Desktop Graphics Cards to narrow down my search. The part that I am struggling with is developing a for item in range loop that will let me iterate through all 8 search result pages. I know that I could do this by simply changing the page number in the URL, but as this is an exercise that I'm trying to use to learn Relative Xpath better, I want to do it using the Pagination buttons at the bottom of the page. I know that each button should contain inner text of "1,2,3,4 etc." but whenever I use text() = {item} in my for loop, it doesn't click the button. The script runs and doesn't return any exceptions, but doesn't do what I want it too. Below I have attached the HTML for the page as well as my current script. Any suggestions or hints are appreciated.
from selenium import webdriver
from selenium.webdriver.chrome.options import Options
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from selenium.webdriver.common.by import By
from selenium.common.exceptions import NoSuchElementException
import pandas as pd
import time
options = Options()
PATH = 'C://Program Files (x86)//chromedriver.exe'
driver = webdriver.Chrome(PATH)
url = 'https://www.newegg.com/p/pl?d=RTX+3080'
driver.maximize_window()
driver.get(url)
card_path = '/html/body/div[8]/div[3]/section/div/div/div[1]/div/dl[1]/dd/ul[2]/li/a'
desktop_graphics_cards = driver.find_element(By.XPATH, card_path)
desktop_graphics_cards.click()
time.sleep(5)
graphics_card = []
shipping_cost = []
price = []
total_cost = []
for item in range(9):
try:
#next_page_click = driver.find_element(By.XPATH("//button[text() = '{item + 1}']"))
print(next_page_click)
next_page_click.click()
except:
pass

The pagination buttons are out of the initially visible area.
In order to click these elements you will have to scroll the page until the element appears.
Also, you will need to click next page buttons starting from 2 up to 9 (including) while you trying to do this with numbers from 1 up to 9.
I think this should work better:
from selenium import webdriver
from selenium.webdriver.chrome.options import Options
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from selenium.webdriver.common.by import By
from selenium.common.exceptions import NoSuchElementException
import pandas as pd
import time
options = Options()
PATH = 'C://Program Files (x86)//chromedriver.exe'
driver = webdriver.Chrome(PATH)
url = 'https://www.newegg.com/p/pl?d=RTX+3080'
actions = ActionChains(driver)
driver.maximize_window()
driver.get(url)
card_path = '/html/body/div[8]/div[3]/section/div/div/div[1]/div/dl[1]/dd/ul[2]/li/a'
desktop_graphics_cards = driver.find_element(By.XPATH, card_path)
desktop_graphics_cards.click()
time.sleep(5)
graphics_card = []
shipping_cost = []
price = []
total_cost = []
for item in range(2,10):
try:
next_page_click = driver.find_element(By.XPATH(f"//button[text() = '{item}']"))
actions.move_to_element(next_page_click).perform()
time.sleep(2)
#print(next_page_click) - printing a web element itself will not give you usable information
next_page_click.click()
#let the next page loaded, it takes some time
time.sleep(5)
except:
pass

Selenium - Google Travel Scraping Price History missing

I am returning html with this python script but it doesn't return price history (see screenshot). Using non-selenium browser does return html with the prices (even without expending this section by simple regex); chrome/safari/firefox all do, incognito as well.
from selenium import webdriver
import time
url = 'https://www.google.com/flights?hl=en#flt=SFO.JFK.2021-06-01*JFK.SFO.2021-06-07'
options = webdriver.ChromeOptions()
driver = webdriver.Chrome(options=options)
driver.get(url)
time.sleep(10)
html = driver.page_source
print(html)
driver.quit()
I can't quite pinpoint if it's some setting in chromedriver. It is possible to do because there is a 3rd party scraper that currently returns this data.
Tried this to no avail. Can a website detect when you are using Selenium with chromedriver?
Any thoughts appreciated.

After I added chrome_options.add_argument("--disable-blink-features=AutomationControlled") I started to see this block. Not sure why it is not always loaded.
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support import expected_conditions as EC
from selenium.webdriver.support.wait import WebDriverWait
from selenium.webdriver.chrome.options import Options
url = 'https://www.google.com/flights?hl=en#flt=SFO.JFK.2021-06-01*JFK.SFO.2021-06-07'
chrome_options = Options()
chrome_options.add_argument("start-maximized")
chrome_options.add_argument("--disable-blink-features=AutomationControlled")
driver = webdriver.Chrome(executable_path='/snap/bin/chromium.chromedriver', chrome_options=chrome_options)
driver.get(url)
# wait = WebDriverWait(driver, 20)
# wait.until(EC.visibility_of_element_located((By.CSS_SELECTOR, ".EA71Tc.q7Eewe")))
time.sleep(10)
history = driver.find_element_by_css_selector(".EA71Tc.q7Eewe").get_attribute("innerHTML")
print(history)
Here the full block is returned, including all tag names. As you see, I tried explicit waits, but this block was not visible. Experiment with adding another explicit wait.

Python selenium to select first element from the dropdown list

The below piece of code clicks the file menu on a page which contain excel worksheet.
from selenium import webdriver
driver = webdriver.PhantomJS()
driver.set_window_size(1120, 550)
driver.get(r"foo%20Data%20235.xlsx&DefaultItemOpen=3") # dummy link
driver.find_element_by_css_selector('#jewel-button-middle > span').click() # responsible for clicking the file menu
driver.quit()
And I don't know how to click the first option ie, Download a snapshot option from the popup menu. I can't able to inspect the elements of pop up or dropdown menu. I want the xlsx file to get downloaded.

The idea is to load the page with PhantomJS, wait for the contents of the workbook to load, get all the necessary parameters for the download file handler endpoint request which we can do with requests package.
Full working solution:
import json
import requests
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
WORKBOOK_TYPE = "PublishedItemsSnapshot"
driver = webdriver.PhantomJS()
driver.maximize_window()
driver.get('http://www.cbe.org.eg/en/EconomicResearch/Publications/_layouts/xlviewer.aspx?id=/MonthlyStatisticaclBulletinDL/External%20Sector%20Data%20235.xlsx&DefaultItemOpen=1#')
wait = WebDriverWait(driver, 10)
wait.until(EC.presence_of_element_located((By.ID, "ctl00_PlaceHolderMain_m_excelWebRenderer_ewaCtl_rowHeadersDiv")))
# get workbook uri
hidden_input = wait.until(EC.presence_of_element_located((By.ID, "ctl00_PlaceHolderMain_m_excelWebRenderer_ewaCtl_m_workbookContextJson")))
workbook_uri = json.loads(hidden_input.get_attribute('value'))['EncryptedWorkbookUri']
# get session id
session_id = driver.find_element_by_id("ctl00_PlaceHolderMain_m_excelWebRenderer_ewaCtl_m_workbookId").get_attribute("value")
# get workbook filename
workbook_filename = driver.find_element_by_xpath("//h2[contains(#class, 's4-mini-header')]/span[contains(., '.xlsx')]").text
driver.close()
print("Downloading workbook '%s'..." % workbook_filename)
response = requests.get("http://www.cbe.org.eg/en/EconomicResearch/Publications/_layouts/XlFileHandler.aspx", params={
'id': workbook_uri,
'sessionId': session_id,
'workbookFileName': workbook_filename,
'workbookType': WORKBOOK_TYPE
})
with open(workbook_filename, 'wb') as f:
for chunk in response.iter_content(chunk_size=1024):
if chunk: # filter out keep-alive new chunks
f.write(chunk)

It easier to inspect such elements (closing dropdowns) using FireFox, open the developer tools and just stand on the element with the mouse cruiser after selecting the option from FireBug toolbar (marked in red square in the picture).
As for the question, the locator you are looking for is ('[id*="DownloadSnapshot"] > span')
from selenium import webdriver
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
driver = webdriver.PhantomJS()
driver.set_window_size(1120, 550)
driver.get(r"foo%20Data%20235.xlsx&DefaultItemOpen=3") # dummy link
wait = WebDriverWait(driver, 10)
wait.until(EC.invisibility_of_element_located((By.CSS_SELECTOR, '[id*="loadingTitleText"]')))
driver.find_element_by_css_selector('#jewel-button-middle > span').click() # responsible for clicking the file menu
download = wait.until(EC.visibility_of_element_located((By.CSS_SELECTOR, '[id*="DownloadSnapshot"] > span')))
driver.get_screenshot_as_file('fileName')
download.click()

I observed the till the excel is completely loaded, File menu is not showing any options. So added wait till the excel book is loaded.
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from time import sleep
from selenium.webdriver.common.action_chains import ActionChains
browser = webdriver.PhantomJS()
browser.maximize_window()
browser.get('http://www.cbe.org.eg/en/EconomicResearch/Publications/_layouts/xlviewer.aspx?id=/MonthlyStatisticaclBulletinDL/External%20Sector%20Data%20235.xlsx&DefaultItemOpen=1#')
wait = WebDriverWait(browser, 10)
element = wait.until(EC.visibility_of_element_located((By.XPATH, "//td[#data-range='B59']")))
element = wait.until(EC.element_to_be_clickable((By.ID, 'jewel-button-middle')))
element.click()
eleDownload = wait.until(EC.element_to_be_clickable((By.XPATH,"//span[text()='Download a Snapshot']")))
eleDownload.click()
sleep(5)
browser.quit()

find the element by id/tag, inspect options in a loop, select the one you want then do the click.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

get string from pandas column shorter than real - python

Related

How to extract the comments count correctly

Scraping webpage with tabs that do not change url

Selenium Web Scraping: Find element by text not working in script

Selenium - Google Travel Scraping Price History missing

Python selenium to select first element from the dropdown list

Categories

Resources