Python selenium crawling - python

Here is code
driver = webdriver.Chrome()
driver.get('https://tieba.baidu.com/f?kw=比特币&ie=utf-8&tab=good')
driver.find_elements_by_css_selector('a.j_th_tit')[0].click()
a = driver.find_elements_by_css_selector('div.d_post_content.j_d_post_content.clearfix')
for i in a:
print(i.text)
Here is HTML I'm struggling with. There are many texts at the page, but those all have same class; d_post_content j_d_post_content clearfix.
<div id='post_content_52497574149' class='d_post_content j_d_post_content clearfix' style='display:;'> Here is the Text that I need to get; it is written in Chinese and stackoverflow may not permit to writhe Chinese in the body </div>
I want to automatically access to the website and get some texts for my homework assignment. With this code above, I could open the website, click the link, but I cannot access to the text needed. All of the texts needed are in the class, so I tried to access to the Class to get the texts, but it didn't work. When I check the length of the list a, len(a) is zero. Could anyone help me?

This line bring you to a new tab:
driver.find_elements_by_css_selector('a.j_th_tit')[0].click()
So you need switch it first. After perform the above, just add a line:
driver.switch_to.window(driver.window_handles[-1])

When you click the link in this statement:
driver.find_elements_by_css_selector('a.j_th_tit')[0].click()
A new tab is opened. But you are not switching to that tab.
I would recommend adding this statement:
driver.switch_to.window(driver.window_handles[-1])
Before you actually call find_elements_by_css_selector.
It will solve your issue.

Related

How to open and scrape multiple links with Selenium

I am new to scraping with Python and have encountered a weird issue.
I am attempting to scrape of OCR'd newspaper articles from a list of URLS using selenium -- the proxy settings on the data source make this easier than other options.
However, I receive tracebacks for the text data every time I run my code. Here is the code that I am using:
article_links = []
for link in driver.find_elements_by_xpath('/html/body/div[1]/main/section[1]/ul[2]/li[*]/div[2]/div[1]/h3/a'):
links = link.get_attribute("href")
article_links.append(links)
articles = []
for article in article_links:
driver.switch_to.window(driver.window_handles[-1])
driver.get(article)
driver.find_element_by_css_selector("#js-doc-explorer-show-additional-views").click()
time.sleep(1)
for article_text in driver.find_elements_by_css_selector("#ocr-container > div.fulltext-ocr.js-page-ocr"):
articles.append(article_text)
I come closest to solving the issue by using .click(), which opens a hidden panel for my data. However, upon using this code, the only data that fills is the last row in the dataset. Without the .click(), all rows come back with nothing. Changing the sleep settings also does not help.
The Xpath for the text data is:
/html/body/div[2]/main/section/div[2]/div[2]/section[2]/div/div[4]/text()
Alternatively, is there a way to get each link's source code and parse it with beautifulsoup after the fact?
UPDATE: There has to be something wrong with the loops -- I can get either the first or last values, but nothing in between.
In a more recent version of Selenium, the method find_elements_by_xpath() is deprecated. Is that the issue you are facing? If it is, import from selenium.webdriver.common.by import By and change it to find_elements(By.XPATH, ...) Similarly, find_elements_by_css_selector() is replaced with find_elements(By.CSS_SELECTOR, ...)
You don't specify if this is even the issue, but if it is, I hope this helps :-)
The solution is found by calling the relevant (unique) class and specifying that it must contain text.
news = []
for article in article_links:
driver2.get(article)
driver2.find_element(By.CSS_SELECTOR, "#js-doc-explorer-show-additional-views").click()
article_text = driver2.find_element(By.XPATH, '//div[#class="fulltext-ocr js-page-ocr"][contains(text()," ")]')
news.append([article_text.text])

Selenium parsing whole document instead of webelement

this problem is really driving me crazy! Here's my code:
list_divs = driver.find_elements_by_xpath("//div[#class='myclass']")
print(f'Number of divs found: {len(list_divs)}') #Correct number displayed
for art in list_divs:
mybtn = art.find_elements_by_xpath('//button') #There are 2 buttons in each div
print(f'Number of buttons found = {len(mybtn)}') #Incorrect number (129 instead of 2)
mybtn[2].click() #Wrong button clicked!
The button clicked IS NOT in the art Html but at the very beginning of webpage!!! Seems like Selenium is parsing the whole document instead of webelement art...
I've printed the outerHTML of variable art and it's correct: only the div code which contains 2 buttons!!!! So why the find_elements_by_xpath() function applied to the webelement art is not parsing the div but the whole html page??!!!
Totally incomprehensible for me!
Because you are using mybtn = art.find_elements_by_xpath('//button') where //button ignores your search context since it starts from //. Change it to:
mybtn = art.find_elements_by_xpath('.//button')
I can't post any html code (the page is about 1,000 lines long).
So far, the only way I saw to go through this is to avoid parsing webelements and make parsing of entire webpage for each element I need:
list_divs = driver.find_elements(By.XPATH, "//div[#class='myclass']")
buttons = driver.find_elements(By.XPATH,"//div[#class='myclass']//button")
and then iterate through the lists to access the button I need for each div. Works perfectly like this. I still don't catch how a xpath applied to a given html code can return something that is not inside this html code...
I'll make other tests with other webpages to see if the problem comes from Selenium.
Thanks for help!

Search results don't change URL - Web Scraping with Python and Selenium

I am trying to create a python script to scrape the public county records website. I ultimately want to be able to have a list of owner names and the script run through all the names and pull the most recent deed of trust information (lender name and date filed). For the code below, I just wrote the owner name as a string 'ANCHOR EQUITIES LTD'.
I have used Selenium to automate the entering of owner name into form boxes but when the 'return' button is pressed and my results are shown, the website url does not change. I try to locate the specific text in the table using xpath but the path does not exist when I look for it. I have concluded the path does not exist because it is searching for the xpath on the first page with no results shown. BeautifulSoup4 wouldn't work in this situation because parsing the url would only return a blank search form html
See my code below:
from selenium import webdriver
from selenium.webdriver.common.keys import Keys
browser = webdriver.Chrome()
browser.get('http://deed.co.travis.tx.us/ords/f?p=105:5:0::NO:::#results')
ownerName = browser.find_element_by_id("P5_GRANTOR_FULLNAME")
ownerName.send_keys('ANCHOR EQUITIES LTD')
docType = browser.find_element_by_id("P5_DOCUMENT_TYPE")
docType.send_keys("deed of trust")
ownerName.send_keys(Keys.RETURN)
print(browser.page_source)
#lenderName = browser.find_element_by_xpath("//*[#id=\"report_results\"]/tbody[2]/tr/td/table/tbody/tr[25]/td[9]/text()")
enter code here
I have commented out the variable that is giving me trouble.. Please help!!!!
If I am not explaining my problem correctly, please feel free to ask and I will clear up any questions.
I think you almost have it.
You match the element you are interested in using:
lenderNameElement = browser.find_element_by_xpath("//*[#id=\"report_results\"]/tbody[2]/tr/td/table/tbody/tr[25]/td[9]")
Next you access the text of that element:
lenderName = lenderNameElement.text
Or in a single step:
lenderName = browser.find_element_by_xpath("//*[#id=\"report_results\"]/tbody[2]/tr/td/table/tbody/tr[25]/td[9]").text
have you used following xpath?
//table[contains(#summary,"Search Results")]/tbody/tr
I have checked it's work perfect.In that, you have to iterate each tr

Using Python and Selenium why am I unable to find link by link text?

I have a list webelement that has a bunch of links within it. The html looks like:
<li>
<span class="ss-icon"></span> Remove
<a href="/sessions/new"><span class="ss-icon"></span> Sign in to save items</a
...
When I try to do something like:
link = element.find_element_by_link_text('Sign in to save items')
I get an error that says:
NoSuchElementException: Message: Unable to locate element:
{"method":"link text","selector":"Sign in to save items"}
I have been able to find this link by instead doing a find_elements_by_tag_name('a') and then just using the link with the correct HREF, but I would like to understand why the first method fails.
It happened to me before that the find_element_by_link_text method sometimes works and sometimes doesn't work; even in a single case. I think it's not a reliable way to access elements; the best way is to use find_element_by_id.
But in your case, as I visit the page, there is no id to help you. Still you can try find_elements_by_xpath in 2 ways:
1- Accessing title: find_element_by_xpath["//a[contains(#title = 'Sign in to save items')]"]
2- Accessing text: find_element_by_xpath["//a[contains(text(), 'Sign in to save items')]"]
Hope it helps.
The problem is, most likely, in the extra spaces before or/and after the link text. You can still approach it with a "partial link text' match:
element.find_element_by_partial_link_text('Sign in to save items')

Python Splinter clicking button CSS

I'm having trouble selecting a button in my Splinter script using the find_by_css method. The documentation is sparse at best, and I haven't found a lot of good articles out there with examples.
br.find_by_css('div#edit-field-download-files-und-0 a.button.launcher').first.click()
...where br is my browser instance.
I've tried a few different ways of writing it. I'm really not sure how I'm supposed to do it because the documentation doesn't give any hard examples of the syntax.
Here's a screenshot of the element.
Sorry the screenshot kind of sucks.
Does anyone have any experience with this?
The css selector looks alright, just that i am not sure from where have you got find_by_css as a method?
How about this :-
br.find_element_by_css_selector("div#edit-field-download-files-und-0 a.button.launcher").click()
Selenium provides the following methods to locate elements in a page:
find_element_by_id
find_element_by_name
find_element_by_xpath
find_element_by_link_text
find_element_by_partial_link_text
find_element_by_tag_name
find_element_by_class_name
find_element_by_css_selector
To find multiple elements (these methods will return a list):
find_elements_by_name
find_elements_by_xpath
find_elements_by_link_text
find_elements_by_partial_link_text
find_elements_by_tag_name
find_elements_by_class_name
find_elements_by_css_selector
I'm working on something similar where I'm trying to click stuff on a webpage. The documentation for find_by_css() is very poor and you need to type the css path to the element you want to click.
Say we want to go to the about tab on python.org
from splinter import Browser
from time import sleep
with Browser() as browser: #<--Create browser instance (firefox default driver)
browser.visit('http://www.python.org') #<--Visits url string
browser.find_by_css('#about > a').click()
# ^--Put css path here in quotes
sleep(5)
If your connection is good you might not get the chance to see the about tab getting clicked but you should end up on the about page.
The hard part is figuring out the css path to an element. However once you have it, the find_by_css() method looks pretty easy
I like the W3Schools reference for CSS selection parameters: http://www.w3schools.com/cssref/css_selectors.asp
As for your code... I recommend breaking this down into a few steps, at least during debug. The call to br.find_by_css('css_string') returns a list of elements. So you can grab that list and check the count.
elems = br.find_by_css('div#edit-field-download-files-und-0 a.button.launcher')
if len(elems) == 1:
elems.first.click()
If you don't check the length of the returned list and call '.first' on an empty list, you'll get an exception. If len > 1, you're probably getting things you don't expect.
Each id on a page is unique, and you can daisy-chain searches, so you can use a few different statements to make this happen:
id_elems = br.find_by_id('edit-field-download-files-und-0')
if id_elems:
id_elem = id_elems.first
a_elems = id_elem.find_by_tag("a")
for e in a_elems:
if e.has_class("button launcher"):
print('Found it!')
e.click()
This is, of course, just one of many ways to do this.
Lastly, Splinter is a wrapper around Selenium and other webdrivers. It's possible that, even after you find the element to click, the actual click won't do anything. If this happens, you can also try clicking on the wrapped Selenium object, available as e._element. So you could try e._element.click() if necessary.

Categories

Resources