I need to scrape this webpage: https://br.advfn.com/bolsa-de-valores/bovespa/aper3/balanco
My Selenium code crashes most times and I noticed that the data I need is not loaded with scripts, so I don't need to use Selenium for this specific webpage and I tried this:
def minerar_advfns(ticker):
quote = 'https://br.advfn.com/bolsa-de-valores/bovespa/' + ticker + '/balanco'
page = requests.get(quote)
time.sleep(0.9)
dom = etree.HTML(page.content)
f = open('www.htm', 'w')
temp1 = str(page.content)
f.write(temp1)
f.close()
lucro = dom.xpath('//*[#id="financials_table_2"]/table/tbody/tr[11]/td[2]/text()')[0]
print(lucro)
I'm trying to scrape the data point '1.605' with this code. It stops in the penultimate line and says 'IndexError: list index out of range'.
When I check the 'www.htm' webpage saved, that '1.605' value is there exactly where my xpath points to. I tried with and without '/text()' and '[0]'. If I remove the [0] it prints '[]'
Xpath has tbody tag problem
You can change //*[#id="financials_table_2"]/table/tbody/tr[11]/td[2]/text()
to //*[#id="financials_table_2"]/table//tr[11]/td[2]/text()
Related
I am extracting some data from URLhttps://blinkit.com/prn/catch-cumin-seedsjeera-whole/prid/56692 with unstructured Product Details elements.
Using this code:
product_details = wd.find_elements(by=By.XPATH, value="//div[#class='ProductAttribute__ProductAttributesDescription-sc-dyoysr-2 lnLDYa']")
info_shelf_life = product_details[0].text.strip()
info_country_of_origin = product_details[1].text.strip()
As you can see the Product details elements are unstructured and this approach is not suitable when the Index gets changed from URL to URL
Hence tried this approach, which throws out a NoSuchWindowException error.
info_shelf_life = wd.find_element(By.XPATH,value= "//div[[contains(#class, 'ProductAttribute__ProductAttributesDescription-sc-dyoysr-2 lnLDYa') and contains(., 'Shelf Life')]/..")
print(info_shelf_life.text.strip())
How can I extract text inside div based on text inside span tags?
Your XPath is invalid. You can try
info_shelf_life = wd.find_element(By.XPATH, '//p[span="Shelf Life"]/following-sibling::div').text
info_country_of_origin = wd.find_element(By.XPATH, '//p[span="Country of Origin"]/following-sibling::div').text
to get required data
I am trying to scrape the data from naukri.com, here I am trying to scrape the location details for each recruiter visible on the page.
The code I am writing is :
def sol7(url):
driver = webdriver.Chrome(r'C:\Users\Rahul\Downloads\chromedriver_win32\chromedriver.exe')
driver.maximize_window()
driver.get(url)
#getting link of recruiters
recruiters_tag = driver.find_element_by_xpath("//ul[#class='midSec menu']/li[2]/a")
driver.get(recruiters_tag.get_attribute('href'))
#search and click for data scientist
driver.find_element_by_xpath("//input[#class='sugInp']").send_keys('Data science')
driver.find_element_by_xpath("//button[#id='qsbFormBtn']").click()
highlight_table_tag = driver.find_elements_by_xpath("//p[#class='highlightable']")
print(len(highlight_table_tag))
for i in highlight_table_tag:
try:
print((i.find_element_by_xpath("//small[#class='ellipsis']")).text)
except NoSuchElementException:
print('-')
1st I have extracted all recruiters details in list highlight_table_tag.
highlight_table_tag includes all the elements on the page however the loop only takes the 0th element of my list.
I want to scrape the location in such a way, if the location tag is not there in each element of highlight_table_tag then print '-' instead.
Please help!!!!
While xpathing child elements use . in front of the xpath otherwise you will be getting it from root node. Which ends up with 1 element everytime.
print((i.find_element_by_xpath(".//small[#class='ellipsis']")).text)
In need of some troubleshooting for some code that does the following:
1) Scrape links from a webpage 2) Scrape text for the links, from the
same page
Had some success in extracting links and writing as a single column:
elements = driver.find_elements_by_xpath("//a[#href]")
with open('csvfile01.csv', "w", newline='') as output:
writer = csv.writer(output)
for element in elements:
writer.writerow([element.get_attribute("href")])
Unfortunately, was stuck when it came to:
1) getting the "text" for the links, and
2) exporting it as a separate column...
3) scraping a specific part of the webpage for links, e.g. in a table ("td") or a div section
The code as it stands now:
from selenium import webdriver
import time
import csv
driver = webdriver.Chrome()
driver.get("https://en.wikipedia.org/wiki/Main_Page")
time.sleep(5)
columns = ['text', 'link']
e1 = driver.find_element_by_css_selector("a")
e2 = driver.find_elements_by_xpath("//a[#href]")
elements = zip(e1,e2)
time.sleep(5)
with open('csvfile01.csv', "w", newline='') as output:
writer = csv.writer(output)
for element in elements:
writer.writerow(columns)
writer.writerows(elements)
driver.quit()
Any suggestions would be much appreciated. Thanks!
As far as the getting the text goes , you can do .text , also your css selector dosent seem right considering it is only “a”, to get an xpath/css selector just inspect the element and right click it then click copy then you get a list of things to copy, I do not use selenium much but when I did use it I noticed in the xpath that only 1 number would change (like if it’s a table of proxies) so I just defined a counter and incremented it in a loop
I'm writing a program to iterate through elements on a webpage. I start the browser like so:
self.browser = webdriver.Chrome(executable_path="C:/Users/me/chromedriver.exe")
self.browser.get("https://www.google.com/maps/place/Foster+Street+Coffee/#36.0016436,-78.9018397,19z/data=!4m7!3m6!1s0x89ace473f05b7d39:0x42c63a92682d9ec3!8m2!3d36.0016427!4d-78.9012927!9m1!1b1")
this opens the site, in which I can find an element i'm interested in using:
reviews = self.browser.find_elements_by_class_name("section-review-line")
now I have a list of elements for class name "section-review-line", which seems to populate correctly. I'd like to iterate through this list of elements and pick out subelements with a set of logic. To get the subelements, which I know exist as class name "section-review-review-content", I try this:
for review in reviews:
content = review.find_element_by_class_name("section-review-review-content")
This errors out with:
selenium.common.exceptions.NoSuchElementException: Message: no such element: Unable to locate element: {"method":"css selector","selector":".section-review-review-content"}
Ok, here are all the items information that you need from each review.
reviews = driver.find_elements_by_class_name("section-review-content")
for review in reviews:
reviewer = review.find_element_by_class_name("section-review-title").text
numOfReviews = review.find_element_by_xpath(".//div[#class='section-review-subtitle']//span[contains(.,'reviews')]").text.strip().replace('.','')
numberOfStarts = review.find_element_by_class_name("section-review-stars").get_attribute('aria-label').strip()
publishDate = review.find_element_by_class_name("section-review-publish-date").text
content = review.find_element_by_class_name("section-review-review-content").text
ah figured it out,
using a strange page that had an empty element up top, causing it to error out. the large majority of the elements did not have this problem, using a try catch solved it like so:
reviews = self.browser.find_elements_by_class_name("section-review-line")
for review in reviews:
try:
content = review.find_element_by_class_name("section-review-review-content")
rtext = content.find_element_by_class_name("section-review-text").text
except:
continue
I am trying to get the uniprot ID from this webpage: ENSEMBL . But I am having trouble using xpath. Right now I am getting an empty list and I do not understand why.
My idea is to write a small function that takes the ENSEMBL IDs and returns the uniprot ID.
import requests
from lxml import html
ens_code = 'ENST00000378404'
webpage = 'http://www.ensembl.org/id/'+ens_code
response = requests.get(webpage)
tree = html.fromstring(response.content)
path = '//*[#id="ensembl_panel_1"]/div[2]/div[3]/div[3]/div[2]/p/a'
uniprot_id = tree.xpath(path)
print uniprot_id
Any help would be appreciated :)
It is only printing the existing lists but is still returning the Nonetype list.
def getUniprot(ensembl_code):
ensembl_code = ensembl_code[:-1]
webpage = 'http://www.ensembl.org/id/'+ensembl_code
response = requests.get(webpage)
tree = html.fromstring(response.content)
path = '//div[#class="lhs" and text()="Uniprot"]/following-sibling::div/p/a/text()'
uniprot_id = tree.xpath(path)
if uniprot_id:
print uniprot_id
return uniprot_id
Why you getting an empty list is because it looks like you used the xpath that chrome supplied when you right clicked and chose copy xpath, the reason your xpath returns nothing is because the tag is not in the source, it is dynamically generated so what requests returns does not contain the element.
In [6]: response = requests.get(webpage)
In [7]: "ensembl_panel_1" in response.content
Out[7]: False
You should always check the page source to see what you are actually getting back, what you see in the developer console is not necessarily what you get when you download the source.
You can also use a specific xpath in case there were other http://www.uniprot.org/uniprot/ on the page, searching the divs for a class with "lhs" and the text Uniprot then get the text from the first following anchor tag:
path = '//div[#class="lhs" and text()="Uniprot"]/following::a[1]/text()'
Which would give you:
['Q8TDY3']
You can also select the following sibling div where the anchor is inside it's child p tag:
path = '//div[#class="lhs" and text()="Uniprot"]/following-sibling::div/p/a/text()'