I'm trying to scrape using selenium and python, the web page have a paginator in javascript, when I click in the button, I can see that the content reload but when I try to get the new table information it's the same old table info, selenium doesn't noticed that de DOM info has changed, I'm aware about the stale DOM, I'm just looking for the best path to solve this problem
for link in source.find_all('div', {'class': 'company-row d-flex'}):
print(link.a.text, link.small.text, link.find('div', {'class': 'col-2'}).text)
# Next button (I´ll make an iterator)
driver.find_element_by_xpath('//a[#href="hrefcurrentpage=2"]').click()
# Tried this and doesn't work
# time.sleep(5)
# Here the table change but get the same old info
for link in source.find_all('div', {'class': 'company-row d-flex'}):
print(link.a.text, link.small.text, link.find('div', {'class': 'col-2'}).text) ```
I think you are getting the same data after opening the next page even after delay since you are getting your data from the existing source.
So, you should re-read, reload the source after clicking the pagination, possibly with some delay.
Related
I'm trying to use pd.read_html() to read the current page I'm trying to scrape using Selenium.
The only problem is the web page does not contain a table until you press a few buttons using selenium button click and then the table is displayed.
So when I input an argument:
pd.read_html('html_string')
It gives me an error.
Is there a way to read in the current page after the buttons have been clicked and not just putting in the html string as an argument?
I've also looked at the documentation for this and could not find anything to help.
Thanks for reading/answering
I would try to pass a page source instead of an address when the source is updated:
url = ...
button_id = ...
driver.get(url)
button = driver.find_element(by=button_id)
button.click()
... # wait?
data = pd.read_html(driver.page_source)
I try to crawl data from a dynamic web using selenium. It required an account to log in, and I must click on some link to forward to information page. After doing all these steps, I found that the source code is not changed and I can not get element that exist on new page. In otherhands, I get direct to this page, and do login but source code I get is parent pages. Can you explain to me why and how to tackle this problem?
how I perform click action
element = driver.find_element(By.CLASS_NAME, "class_name")
element2 = element.find_element(By.CSS_SELECTOR, "css_element")
element2.click()
how I get source code:
page_source = driver.execute_script("return document.body.outerHTML")
with open('a.html', 'w') as f:
f.write(page_source)
I am having some trouble scraping the url below:
http://102.37.123.153/Lists/eTenders/AllItems.aspx
I am using Python with Selenium, but have many "onclick" javascript events to run to get to lowest level of information. Does anyone know how to automate this?
Thanks
url = 'http://102.37.123.153/Lists/eTenders/AllItems.aspx'
chrome_options = Options()
chrome_options.add_argument("--headless")
browser = webdriver.Chrome('c:/Users/AB/Dropbox/ITProjects/Scraping/chromedriver.exe', options=chrome_options)
res = browser.get(url)
time.sleep(10)
source = browser.page_source
soup = BeautifulSoup(source)
for link in soup.find_all('a'):
if link.get('href') == 'javascript:':
print(link)
You don't need selenium with this website, you need patience. Let me explain how you'd approach that.
Click X
Y opens, click Y
Z opens, click Z.
Goes on..........
What happened here is that when you've clicked X, an AJAX request was made to get Y and after you click Y, another AJAX was made to get Z and then this goes on.
So you can just simulate those requests, open the networks tab and see how does it craft the requests then make the same ones in your code then get the response, based on it, do the next request and the cycle will go on till you get to the innermost level of the tree.
This approach has no UI and is technically-speaking, more unfriendly and harder to implement. But it's more efficient, on the other side, you can just select your clickable elements with selenium like
eleme = driver.find_elemnent_by_x('x')
elem.click()
And it will also work
I'd also note that sometimes, links don't AJAX, they just hide the info but it's in the source code. To know what you'll recieve in your response, R-click in the website and choose View page source and note that this is different than inspect element.
I am trying to automatically collect articles from a database which first requires me to login.
I have written the following code using selenium to open up the search results page, then wait and allow me to login. That works, and it can get the links to each item in the search results.
I want to then continue use selenium to continue to visit each of the links in the search results and collect the article text
browser = webdriver.Firefox()
browser.get("LINK")
time.sleep(60)
lnks = browser.find_elements_by_tag_name("a")[20:40]
for lnk in lnks:
link = lnk.get_attribute('href')
print(link)
I can't get any further. How should I then make it visit these links in turn and get the text of the articles for each one?
I tried to add driver.get(link) to the for loop, I got the 'selenium.common.exceptions.StaleElementReferenceException'
On the request of the database owner, I have removed the screenshots previously posted in this post, as well as information about the database. I would like to delete the post completely, but am unable to do so.
You need to seek bs4 tutroials, but here is starter
html_source_code = Browser.execute_script("return document.body.innerHTML;")
soup = bs4.BeautifulSoup(html_source_code, 'lxml')
links = soup.find_all('what-ever-the-html-code-is')
for l in links:
print(l['href'])
I am trying to get some comments off the car blog, Jalopnik. It doesn't come with the web page initially, instead the comments get retrieved with some Javascript. You only get the featured comments. I need all the comments so I would click "All" (between "Featured" and "Start a New Discussion") and get them.
To automate this, I tried learning Selenium. I modified their script from Pypi, guessing the code for clicking a link was link.click() and link = broswer.find_element_byxpath(...). It doesn't look liek the "All" button (displaying all comments) was pressed.
Ultimately I'd like to download the HTML of that version to parse.
from selenium import webdriver
from selenium.common.exceptions import NoSuchElementException
import time
browser = webdriver.Firefox() # Get local session of firefox
browser.get("http://jalopnik.com/5912009/prius-driver-beat-up-after-taking-out-two-bikers/") # Load page
time.sleep(0.2)
link = browser.find_element_by_xpath("//a[#class='tc cn_showall']")
link.click()
browser.save_screenshot('screenie.png')
browser.close()
Using Firefox with the Firebug plugin, I browsed to http://jalopnik.com/5912009/prius-driver-beat-up-after-taking-out-two-bikers.
I then opened the Firebug console and clicked on ALL; it obligingly showed a single AJAX call to http://jalopnik.com/index.php?op=threadlist&post_id=5912009&mode=all&page=0&repliesmode=hide&nouser=true&selected_thread=null
Opening that url in a new window gets me the comment feed you are seeking.
More generally, if you substitute the appropriate article-ID into that url, you should be able to automate the process without Selenium.