http://www.vliz.be/vmdcdata/mangroves/aphia.php?p=browser&id=235056&expand=true#ct
(That's the information I am trying to scrape)
I wanna to scrape this detailed taxonomic trees so that I can manipulate them anyway I like.
But there are a few problem in geting this tree data.
I can' t fully expand the taxonomic tree . when some expanding ,some collapse as the instruction indicated .
so saving the full page as html files can not sove my problem.
or I can repeat the process some times to get separate files and concatenate them.. but it seems to be a ugly way.
I am tired of clicking , there are so many "plus" signs and I have to wait.
Is there a way to solve this out using Python ?
Use Selenium, this will expand the tree by clicking on the "plus signs" and get the entire DOM with all the elements in it after it's done:
from selenium import webdriver
import time
browser=webdriver.Chrome()
browser.get('http://www.vliz.be/vmdcdata/mangroves/aphia.php?p=browser&id=235301&expand=true#ct')
while True:
try:
elem=browser.find_elements_by_xpath('.//*[#src="http://www.marinespecies.org/images/aphia/pnode.gif" or #src="http://www.marinespecies.org/images/aphia/plastnode.gif"]')[1]
elem.click()
time.sleep(2)
except:
break
content=browser.page_source
Related
https://squidindustries.co/checkout
checkout_cc_number = driver.find_element_by_id("number")
checkout_cc_number.send_keys(card_number)
When I try to input information into the card number field I get an error saying the element could not be located. I tried using time.sleep and driver.implicitly_wait when i first got to the page but both failed. Any ideas?
The element is in a frame (i.e. a webpage within a webpage). Selenium will look for elements in the page it has loaded and not within frames. That's the problem.
To solve this we just need a bit more code, which will tell Selenium to look in the frame.
The example you've given is several pages deep into a shopping cart, so I'm going to use a much more accessible example instead: the mozilla guide to iframes.
Here is some code to open that page and then click the CSS button within the frame:
from selenium import webdriver
import time
browser = webdriver.Chrome()
browser.get(r"https://developer.mozilla.org/en-US/docs/Web/HTML/Element/iframe")
time.sleep(5)
browser.switch_to.frame(browser.find_element_by_class_name("interactive"))
css_button = browser.find_element_by_id("css")
css_button.click()
browser.switch_to.default_content()
There are two lines that are important. The first one is:
browser.switch_to.frame(browser.find_element_by_class_name("interactive"))
That finds the frame and then switches to it. Once we have done that, any code that looks for elements will be looking in the frame and not in the page that we navigated to. That is what you need to do to access the number element. In your example the class of the frame is card-fields-iframe, so use that instead of interactive.
The second important line is:
browser.switch_to.default_content()
That reverts the previous line. So now Selenium will be looking for elements within the page that we navigated to. You'll want to do that after interacting with the frame, so that you can continue through the shopping cart.
have you tried getting the input element using the DOM? what happens if you do document.getElementById('number') ?
I ran into the same issue, and with checkouts, as you mentioned, all the iframe class names are the same. What I did was get all the iframes with the same class name as a list:
iframes = driver.find_elements(By.CLASS_NAME, "card-fields-iframe")
I then switched through the iframes referencing each one by its place in the list. Since there are only four fields in the checkout, the list is only 4 elements long, starting with [0].
driver.switch_to.frame(iframes[0])
number = driver.find_element(By.ID, "number")
if number.is_displayed:
number.send_keys("4000300040005000")
driver.switch_to.default_content()
It's important to note that switching back to the default content, using driver.switch_to.default_content(), before switching to the next frame is the only way I was able to make this work. The is_displayed function just checks to see whether the element is on the page or not.
I want to learn how to download the CSV files for the last ten years using python. I think this would be helpful.
https://www.usgovernmentspending.com/compare_state_debt
My attempts involve requests and pandas.
This is a multipart problem and I'm going to outline the steps I think you should use.
The first part is going to be simply downloading the webpage. What I would suggest is use something like requests to get the webpage
Once you have that you can use beautiful soup to parse the webpage.
I took a look at the website and it looks like there are a number of ways you could download the data. I think the best way to get the data is going to be to extract all the text from this particular part in the page.
Once you do that you are probably going to need to clean up the data. I suggest using pandas for that.
People on here aren't going to solve the whole problem for you. That said, if you get stuck along the way and have a specific question, StackOverflow can probably help at that point.
Issue resolved I managed to solve it using selenium.
By doing the following:
from selenium import webdriver # allow launching browser
# Opening in incognito
driver_option = webdriver.ChromeOptions()
#driver_option.add_argument(" — incognito")
chromedriver_path = '# Write your path here' # Change this to your own chromedriver path!
# Creating a webdriver.
def create_webdriver():
return webdriver.Chrome(executable_path=chromedriver_path, options=driver_option)
URL = ""
browser.get(url)
# Clicking the button.
elem1 = browser.find_element_by_link_text("download file")
# Clicking the button.
elem1.click()
I put the previous code in a loop for all the years until 2020 and I got all the files in CSV format
I am trying to grab a bunch numbers that are presented in a table on a web page that I’ve accessed using python and Selenium running headless on a Raspberry Pi. The numbers are not in the page source, rather they are deeply embedded in complex html served by several URLs called by the main page (the numbers update every few seconds). I know I could parse the html to get the numbers I want, but the numbers are already sitting on the front page in perfect format all in one place. I can select and copy the numbers when I view the web page in Chrome on my PC.
How can I use python and get Selenium webdriver to get me those numbers? Can Selenium simply provide all the visible text on a page? How? (I've tried driver.page_source but the text returned does not contain the numbers). Or is there a way to essentially copy text and numbers from a table visible on the screen using python and Selenium? (I’ve looked into xdotool but didn’t find enough documentation to help). I’m just learning Selenium so any suggestions will be much appreciated!
Well, I figured out the answer to my question. It's embarrassingly easy. This line gets just what I need - all the text that is visible on the web page:
page_text = driver.find_element_by_tag_name('body').text
So, there are some different situations why you can not get some info on the page:
Information doesn't loaded yet. You must waiting for some time to get your information ready. You may watch this theme for the better understanding. Some times you get dynamically added page elements with JS and so on, which loading is very slowly.
Information may consists of different type of data. For example you are waiting for a text with numbers, but you may get picture with numbers on the page. In this situation you must change your programming tactics and use another functions to get what you need.
I'm trying to scrape data from this website called Anhembi
But when I try all the options from selenium to find elements, I get nothing. Anyone know why this happens?
I've already tried:
driver.find_element_by_xpath('//*[#class="agenda_result_laco_box"]')
And made a for-loop through that to click in every single one and get the info I need which consists of the day, website and name of the events. How can I do that?
Clearly, there is an iframe involved , you need to switch the focus of your web driver in order to interact with elements which are in iframe/frameset/frame.
You can try with this code :
driver.get("http://www.anhembi.com.br/agenda/")
driver.switch_to.frame(driver.find_element_by_css_selector("iframe[src='http://intranet.spturis.com.br/intranet/modulos/booking/anhembisite_busca.php']"))
all_data = driver.find_elements_by_css_selector("div.agenda_result_laco_box")
print(len(all_data))
for data in all_data:
print(data.text)
I am scraping a webpage using Selenium in Python. I am able to locate the elements using this code:
from selenium import webdriver
import codecs
driver = webdriver.Chrome()
driver.get("url")
results_table=driver.find_elements_by_xpath('//*[#id="content"]/table[1]/tbody/tr')
Each element in results_table is in turn a set of sub-elements, with the number of sub-elements varying from element to element. My goal is to output each element, as a list or as a delimited string, into an output file. My code so far is this:
results_file=codecs.open(path+"results.txt","w","cp1252")
for element in enumerate(results_table):
element_fields=element.find_elements_by_xpath(".//*[text()][count(*)=0]")
element_list=[field.text for field in element_fields]
stuff_to_write='#'.join(element_list)+"\r\n"
results_file.write(stuff_to_write)
#print (i)
results_file.close()
driver.quit()
This second part of code takes about 2.5 minutes on a list of ~400 elements, each with about 10 sub-elements. I get the desired output, but it is too slow. What could I do to improve the prformance ?
Using python 3.6
Download the whole page in one shot, then use something like BeautifulSoup to process it. I haven't used splinter or selenium in a while, but in Splinter, .html will give you the page. I'm not sure what the syntax is for that in Selenium, but there should be a way to grab the whole page.
Selenium (and Splinter, which is layered on top of Selenium) are notoriously slow for randomly accessing web page content. Looks like .page_source may give the entire contents of the page in Selenium, which I found at stackoverflow.com/questions/35486374/…. If reading all the chunks on the page one at a time is killing your performance (and it probably is), reading the whole page once and processing it offline will be oodles faster.