I am trying to make an crawler/auto clicker on python where my target is to click all row header in a table to expand them and show the nested rows. I cannot seem to find the correct selector to use, I tried to use driver.find_elements_id, driver.find_elements_xpath, but it isnt working.
here is what i am using
PATH = "C:/Users/Downloads/chromedriver.exe"
driver = webdriver.Chrome(PATH)
driver.get("https://int.soccerway.com/")
link = driver.find_elements_id("tr")
link.click()
and here is a snippet of the website
<tr class="group-head clickable " id="date_matches-1886" stage-value="14">
<th colspan="4"><h3><span class="flag_16 left_16 australia_16_left">Australia - Queensland NPL Youth League</span></h3></th>
<th class="competition-link"> <span>More…</span></th>
</tr>
id should be date_matches-1886 instead of tr in your example. You may be instrested in the function called find_element_by_tag_name. But I think it is better to use find_element_by_class_name('clickable') in your case. Have a look at Locating Elements for more.
from selenium import webdriver
PATH = "C:/Users/Downloads/chromedriver.exe"
driver = webdriver.Chrome(PATH)
driver.get('https://int.soccerway.com/')
link = driver.find_elements_by_class_name('clickable')
link[0].click()
link[2].click()
Related
I use python selenium to do web scraping. And Iwould like to catch website with both in a specific date (like [01-20]) and title with specific text in it (like 'public'), how could the code satisfy both ?
I tried the following but no luck
Thank you in advance!!
href:
<td width="89%" height="26">
sth sth public
</td>
<td width="8%" align="center">[01-20]</td>
<tr>
code:
titles = driver.find_elements_by_css_selector("[title*='public']")
for title in titles:
links=[title.get_attribute('href') for title in driver.find_elements_by_xpath("//td[text()='[01-20]']/preceding::td[1]/a")]
urls = [links.get_attribute("href") for links in driver.find_elements_by_css_selector("[title*='public']")]
for url in urls:
print(url)
driver.get(url)
###do something
use keyword and and contains function in xpath:
'//td[text()="[01-20]"]/preceding::td[1]/a[contains(#title, "资本")]'
check this video for more info
EDIT: changed xpath to a working answer
So I am working on trying to webscrape https://data.bls.gov/cgi-bin/surveymost?bls and was able to figure out how to webcrawl through clicks to get to a table.
The selection that I am practicing on is after you select the checkbox associated with " Employment Cost Index (ECI) Civilian (Unadjusted) - CIU1010000000000A" under Compensation and then select "Retrieve data".
Once those two are processed a table shows. This is the table I am trying to scrape.
Below is the code that I have as of right now.
Note that you have to put your own path for your browser driver where I have put < browser driver >.
from bs4 import BeautifulSoup
from urllib.request import urlopen
import pandas as pd
import numpy as np
import requests
import lxml.html as lh
from selenium import webdriver
url = "https://data.bls.gov/cgi-bin/surveymost?bls"
ChromeSource = r"<browser driver>"
# Open up a Chrome browser and navigate to web page.
options = webdriver.ChromeOptions()
options.add_argument('--ignore-certificate-errors')
options.add_argument('--incognito')
options.add_argument('--headless') # will run without opening browser.
driver = webdriver.Chrome(ChromeSource, chrome_options=options)
driver.get(url)
driver.find_element_by_xpath("//input[#type='checkbox' and #value = 'CIU1010000000000A']").click()
driver.find_element_by_xpath("//input[#type='Submit' and #value = 'Retrieve data']").click()
i = 2
def myTEST(i):
xpath = '//*[#id="col' + str(i) + '"]'
TEST = driver.find_elements_by_xpath(xpath)
num_page_items = len(TEST)
for i in range(num_page_items):
print(TEST[i].text)
myTEST(i)
# Clean up (close browser once completed task).
driver.close()
Right now this only is looking at the headers. I would like to also get the table content as well.
If I make i = 0, it produces "Year". i = 1, it produces "Period". But if I select i = 2 I get two variables which have the same col2 id for "Estimated Value" and "Standard Error".
I tried to think of a way to work around this and can't seem to get anything that I have researched to work.
In essence, it would be better to start at the point where I am done clicking and am at the table of interest and then look at the xpath of the header and pull in the text for all of the sub 's.
<tr> == $0
<th id="col0"> Year </th>
<th id="col1"> Period </th>
<th id="col2">Estimated Value</th>
<th id="col2">Standard Error</th>
<tr>
I am not sure how to do that. I also tried to loop through the {i} but obviously sharing with two header text causes an issue.
Once I am able to get the header, I want to get the contents. I could you some insight on if I am on the right path, overthinking it or if there is a simpler way to do all of this. I am learning and this is my first attempt using the selenium library for clicks. I just want to get it to work so I can try it again on a different table and make it as automate or reusable (with tweaking) as possible.
Actually you don't need selenium, You can just track the POST Form data, and apply the same within your POST request.
Then you can load the table using Pandas easily.
import requests
import pandas as pd
data = {
"series_id": "CIU1010000000000A",
"survey": "bls"
}
def main(url):
r = requests.post(url, data=data)
df = pd.read_html(r.content)[1]
print(df)
main("https://data.bls.gov/cgi-bin/surveymost")
Explanation:
open the site.
Select Employment Cost Index (ECI) Civilian (Unadjusted) - CIU1010000000000A
Now you have to open your browser Developer Tools and navigate to Network Monitor section. etc Press Ctrl + Shift + E ( Command + Option + E on a Mac).
Now you will found a POST request done.
Navigate to Params tab.
Now you can make the POST request. and since the Table is presented within the HTML source and it's not loaded via JavaScript, so you can parse it within bs4 or read it in nice format using pandas.read_html()
Note: You can read the table as long as it's not loaded via JavaScript. otherwise you can try to track the XHR request (Check previous answer) or you can use selenium or requests_html to render JS since requests is an HTTP library which can't render it for you.
I'm trying to create a small script to tell me if addresses need a certain type of shipping.
I have a list of addresses to input into a website and it will return what type they are. Why is this returning none, even though when I check the element in selenium it's there? And technically it has to be there, to even pass the "EC.presence_of_element_located" code.
browser = webdriver.Chrome()
browser.get('courier_website')
field = browser.find_element_by_id("txt-address-auto-complete")
field.send_keys("12 test Street")
WebDriverWait(browser, 10).until(EC.element_to_be_clickable((By.XPATH, "//li[#class='ui-menu-item']/a[contains(#id, 'ui-id-')]")))
browser.find_element_by_xpath("//li[#class='ui-menu-item']/a[contains(#id, 'ui-id-')]").click()
WebDriverWait(browser, 10).until(EC.presence_of_element_located((By.XPATH, '//*[#id="delivery-details-addresstype" and text() != ""]')))
post = browser.find_element_by_xpath('//*[#id="delivery-details-addresstype"]').get_attribute('value')
print(post)
Output is "None"
HTML I'm trying to get the text out of
<table class="delivery-details">
<tbody><tr>
<th colspan="3" id="delivery-details-addresstype">Residential Delivery Zone Address</th>
</tr>
Instead of browser.find_element_by_xpath(//required_path).get_attribute('value'), use:
browser.find_element_by_xpath(//required_path).get_attribute('innerHTML')
In some cases 'textContent' has worked for me
browser.find_element_by_xpath(//path).get_attribute('textContent')
Given this code ("sleep" instances used to help display what's going on):
from splinter import Browser
import time
with Browser() as browser:
# Visit URL
url = "https://mdoe.state.mi.us/moecs/PublicCredentialSearch.aspx"
browser.visit(url)
browser.fill('ctl00$ContentPlaceHolder1$txtCredentialNumber', 'IF0000000262422')
# Find and click the 'search' button
button = browser.find_by_name('ctl00$ContentPlaceHolder1$btnSearch')
# Interact with elements
button.first.click()
time.sleep(5)
#Only click the link next to "Professional Teaching Certificate Renewal"
certificate_link = browser.find_by_xpath("//td[. = 'Professional Teaching Certificate Renewal']/following-sibling::td/a")
certificate_link.first.click()
time.sleep(10)
I am now trying to get the values from the table that shows after this code runs. I am not well-versed in xpath commands, but based on the response to this question, I have tried these, to no avail:
name = browser.find_by_xpath("//td[. ='Name']/following-sibling::td/a")
name = browser.find_by_xpath("//td[. ='Name']/following-sibling::td/[1]")
name = browser.find_by_xpath("//td[. ='Name']/following-sibling::td/[2]")
I tried [2] because I do notice a colon (:) sibling character between "Name" and the cell containing the name. I just want the string value of the name itself (and all other values in the table).
I do notice a different structure (span is used within td instead of just td) in this case (I also tried td span[. ='Name']... but no dice):
Updated to show more detail
<tr>
<td>
<span class="MOECSBold">Name</span>
</td>
<td>:</td>
<td>
<span id="ContentPlaceHolder1_lblName" class="MOECSNormal">MICHAEL WILLIAM LANCE </span>
</td>
</tr>
This ended up working:
browser.find_by_xpath("//td[span='Name']/following-sibling::td")[1].value
I am trying to extract the information from a link from a page that is structured as such:
...
<td align="left" bgcolor="#FFFFFF">$725,000</td>
<td align="left" bgcolor="#FFFFFF"> Available</td>
*<td align="left" bgcolor="#FFFFFF">
<a href="/washington">
Washington Street Studios
<br>1410 Washington Street SW<br>Albany, Oregon, 97321
</a>
</td>*
<td align="center" bgcolor="#FFFFFF">15</td>
<td align="center" bgcolor="#FFFFFF">8.49%</td>
<td align="center" bgcolor="#FFFFFF">$48,333</td>
</tr>
I tried targeting elements with attribute 'align = left' and iterating over it but that didn't work out. If anybody could help me locate the element <a href = "/washington"> (multiple tags like these within the same page) with selenium I would appreciate it.
I would use lxml instead, if it is just to process hxml...
It would be helpful if you're more specific, but you can try this if you are traversing links in a webpage..
from lxml.html import parse
pdoc = parse(url_of_webpage)
doc = pdoc.getroot()
list_of_links = [i[2] for i in doc.iterlinks()]
list_of_links will look like ['/en/images/logo_com.gif', 'http://www.brand.com/', '/en/images/logo.gif']
doc.iterlinks() will look for all links such as form, img, a-tags and yield lists containing Element object containing the tag, the type of tag (form, a or img), the url and a number, so the line list_of_links = [i[2] for i in doc.iterlinks()] simply grab the url and returns as a separate list.
Note that the retrieved url is relative. As in you will see urls like
'/en/images/logo_com.gif'
instead of
'http://somedomain.com/en/images/logo_com.gif'
if you want to have the latter kind of url, add the code
from lxml.html import parse
pdoc = parse(url_of_webpage)
doc = pdoc.getroot()
doc.make_links_absolute() # add this line
list_of_links = [i[2] for i in doc.iterlinks()]
If you are processing the url one by one, then simply modify the code to something like
for i in iterlinks():
url = i[2]
# some processing here with url...
Finally, if for some reason you need selenium to come in to get the webpage content, then simply add the following to the beginning
from selenium import webdriver
from StringIO import StringIO
browser = webdriver.Firefox()
browser.get(url)
doc = parse(StringIO(browser.page_source)).getroot()
From what we have provided at the moment, there is a table and you have the desired links in a specific column. There are no "data-oriented" attributes to rely on, but using column index to locate the links looks good enough:
for row in driver.find_elements_by_css_selector("table#myid tr"):
cells = row.find_elements_by_tag_name("td")
print(cells[2].text) # put a correct index here