Using Xpath to get an string from a webpage - python

I am trying to get the uniprot ID from this webpage: ENSEMBL . But I am having trouble using xpath. Right now I am getting an empty list and I do not understand why.
My idea is to write a small function that takes the ENSEMBL IDs and returns the uniprot ID.
import requests
from lxml import html
ens_code = 'ENST00000378404'
webpage = 'http://www.ensembl.org/id/'+ens_code
response = requests.get(webpage)
tree = html.fromstring(response.content)
path = '//*[#id="ensembl_panel_1"]/div[2]/div[3]/div[3]/div[2]/p/a'
uniprot_id = tree.xpath(path)
print uniprot_id
Any help would be appreciated :)
It is only printing the existing lists but is still returning the Nonetype list.
def getUniprot(ensembl_code):
ensembl_code = ensembl_code[:-1]
webpage = 'http://www.ensembl.org/id/'+ensembl_code
response = requests.get(webpage)
tree = html.fromstring(response.content)
path = '//div[#class="lhs" and text()="Uniprot"]/following-sibling::div/p/a/text()'
uniprot_id = tree.xpath(path)
if uniprot_id:
print uniprot_id
return uniprot_id

Why you getting an empty list is because it looks like you used the xpath that chrome supplied when you right clicked and chose copy xpath, the reason your xpath returns nothing is because the tag is not in the source, it is dynamically generated so what requests returns does not contain the element.
In [6]: response = requests.get(webpage)
In [7]: "ensembl_panel_1" in response.content
Out[7]: False
You should always check the page source to see what you are actually getting back, what you see in the developer console is not necessarily what you get when you download the source.
You can also use a specific xpath in case there were other http://www.uniprot.org/uniprot/ on the page, searching the divs for a class with "lhs" and the text Uniprot then get the text from the first following anchor tag:
path = '//div[#class="lhs" and text()="Uniprot"]/following::a[1]/text()'
Which would give you:
['Q8TDY3']
You can also select the following sibling div where the anchor is inside it's child p tag:
path = '//div[#class="lhs" and text()="Uniprot"]/following-sibling::div/p/a/text()'

Related

FInd_all in bs4 returns one elment when there is more in the web page in

I am doing web scrapping to a new egg page and i want to scrape the rating of the product by the consumers and i am using this code
page = requests.get('https://www.newegg.com/msi-geforce-rtx-3060-rtx-3060-ventus-2x-12g-oc/p/N82E16814137632?Description=gpu&cm_re=gpu-_-14-137-632-_-Product').text
soup = bs(page , 'lxml')
the_rating = soup.find_all( class_ = 'rating rating-4')
print(the_rating)
And it returns only this one element even though I am using the find all element
[<i class="rating rating-4"></i>]
I get [] with your code; judging by the text content, or when I break it print the response status and url
r = requests.get('https://www.newegg.com/msi-geforce-rtx-3060-rtx-3060-ventus-2x-> 12g-oc/p/N82E16814137632?Description=gpu&cm_re=gpu-_-14-137-632-_-Product')
print(f'<{r.status_code} {r.reason}> from {r.url}')
# soup = bs(r.content , 'lxml')
output:
<200 OK> from https://www.newegg.com/areyouahuman?referer=/areyouahuman?referer=https%3A%2F%2Fwww.newegg.com%2Fmsi-geforce-rtx-3060-rtx-3060-ventus-2x-12g-oc%2Fp%2FN82E16814137632%3FDescription%3Dgpu%26cm_re%3Dgpu-_-14-137-632-_-Product&why=8&cm_re=gpu-_-14-137-632-_-Product&Description=gpu
It's been redirected to a CAPTCHA...
Anyway, even if you get past that (I couldn't so I just pasted and parsed the response from my browser's network logs to test), all you can get from page is the source HTML, which does not contain any elements with class="rating rating-4"; using selenium and waiting for the page to finish loading yielded a bit more, but even then there weren't any exact matches.
[There were some matches when I inspected in browser, but only if I wasn't in incognito mode, which is likely why selenium didn't find them either.]
So, the site probably adds or removes some classes based on the source of the request. If you just want to get all elements with both the rating and rating-4 classes (that will include the elements with class="rating is-large rating-4"), you can use .find... with lambda (or define a separate function) or use .select with CSS selectors like
the_rating = soup.select('.rating.rating-4') # shorter than
# .find_all(lambda t: {'rating', 'rating-4'}.issubset(set(t.get('class', []))))
[Just make sure you have the full/correct HTML.]

Python Selenium - How to extract element based on text inside span tag?

I am extracting some data from URLhttps://blinkit.com/prn/catch-cumin-seedsjeera-whole/prid/56692 with unstructured Product Details elements.
Using this code:
product_details = wd.find_elements(by=By.XPATH, value="//div[#class='ProductAttribute__ProductAttributesDescription-sc-dyoysr-2 lnLDYa']")
info_shelf_life = product_details[0].text.strip()
info_country_of_origin = product_details[1].text.strip()
As you can see the Product details elements are unstructured and this approach is not suitable when the Index gets changed from URL to URL
Hence tried this approach, which throws out a NoSuchWindowException error.
info_shelf_life = wd.find_element(By.XPATH,value= "//div[[contains(#class, 'ProductAttribute__ProductAttributesDescription-sc-dyoysr-2 lnLDYa') and contains(., 'Shelf Life')]/..")
print(info_shelf_life.text.strip())
How can I extract text inside div based on text inside span tags?
Your XPath is invalid. You can try
info_shelf_life = wd.find_element(By.XPATH, '//p[span="Shelf Life"]/following-sibling::div').text
info_country_of_origin = wd.find_element(By.XPATH, '//p[span="Country of Origin"]/following-sibling::div').text
to get required data

Alternative to pandas.read_html where ulr is not unique?

I want to access data from an html table from the section "ERGEBNIS" with python 3.7.
The problem is, that the results for each combination of the drop down values are only shown after clicking on submit. This does however not change the url, so that I have no idea how I can access the results table after updating the input values of the drop downs.
Here is what I've done so far:
from selenium import webdriver
from selenium.webdriver.support.ui import Select
import time
browser.get('https://daten.ktbl.de/feldarbeit/entry.html')
#Fix values of the drop down fields:
fertilizer = Select(browser.find_element_by_name("hgId"))
fertilizer.select_by_value("2")
fertilizer = Select(browser.find_element_by_name("gId"))
fertilizer.select_by_value("193")
fertilizer = Select(browser.find_element_by_name("avId"))
fertilizer.select_by_value("383")
fertilizer = Select(browser.find_element_by_name("hofID"))
fertilizer.select_by_value("2")
fertilizer = Select(browser.find_element_by_name("flaecheID"))
fertilizer.select_by_value("5")
fertilizer= Select(browser.find_element_by_name("mengeID"))
fertilizer.select_by_value("60")
# Submit changes to show the results of this particular combination of values
button = browser.find_element_by_xpath("//*[#type='submit']")
button.click()
Submitting the changes does, however, not change the url, so that I don't know how I can access the results (here "ERGEBINS") table.
Otherwise my approach would have been to use pd.read_html somehow like this:
...
url = browser.current_url
time.sleep(1)
df_list = pd.read_html(url, match = "Dieselbedarf")
But since the url isn't unique for each result, this doesn't make sense. Same issue would be with BeautifulSoup, or at least I don't understand how I can do it without a unique url..
Any ideas how I can access the html table otherwise?
EDIT: The answer of #bink1time could solve my problem how to access the table without the url, but via the raw HTML string:
html_source = browser.page_source
df_list = pd.read_html(html_source, match = "Dieselbedarf")
You can probably just get the html source:
html_source = browser.page_source
According to the docs:
https://pandas.pydata.org/pandas-docs/version/0.23.4/generated/pandas.read_html.html
read_html takes a URL, a file-like object, or a raw string containing HTML.
In this case you pass the raw string.
html_source = browser.page_source
df_list = pd.read_html(html_source, match = "Dieselbedarf")
Just a note you don't need to sleep.

Is there any way to get all the "inner html text" of a website and its corresponding coordinates using python selenium?

I'm able to get the div elements by using this code:
divs = driver.find_elements_by_xpath("//div")
and by looping through the divs and using .text attribute I'm able to get the text as well
code:
for i in divs:
print(i.text)
but in my use-case I want the location as well as the size of the text.
Please help !!
My code:
for i in range(0,len(WEBSITES)):
print(timestamp()) #timestamp
print(i,WEBSITES[i]) #name of the website
driver.get(WEBSITES[i])
delay = 10
time.sleep(delay)
img = cv2.imread(os.getcwd() + '/' + str(i)+'.png')#read the image to be inscribed
print("getting div tags \n")
divs = driver.find_elements_by_xpath("//div")# find all the div tags
# anchors = divs.find_elements_by_xpath("//*")#find all the child tags in the divs
for i in divs:
print(i.text.location)
Whenever I try .location or .size attribute I get Unicode error.
Disclaimer: I have searched through all the post so this is not a duplicate question.
Can you try getting the coordinates of the div rather than the text. Like below.
for i in divs:
print(i.location)
Edit
So if you want to get the text coordinates of all text in a page, get the text elements in a page like below and get their coordinates.
textElements = driver.find_elements_by_xpath("//body//*[text()]") #Gets all text elements
for i in textElements:
print(i.text)
print(i.location)

Python beautiful soup web scraper doesn't return tag contents

I'am trying to scrape matches and their respective odds from local bookie site but every site i try my web scraper doesn't return anything rather just prints "Process finished with exit code 0" but doesn't return anything.
Can someone help me crack open the containers and get out the contents.
i have tried all the above sites for almost a month but with no success. The problem seems to be with the exact div, class or probably span element layout.
https://www.betlion.co.ug/
https://www.betpawa.ug/
https://www.premierbet.ug/
for example i tried link 2 in the code as shown
import requests
from bs4 import BeautifulSoup
url = "https://www.betpawa.ug/"
response = requests.get (url, timeout=5)
content = BeautifulSoup (response.content, "html.parser")
for match in content.findAll("div",attrs={"class":"events-container prematch", "id":"Bp-Event-591531"}):
print (match.text.strip())
i expect the program to return a list of matches, odds and all the other components of the container. however the program runs and just prints " "Process finished with exit code 0" nothing else
it looks like the base site gets loaded in two phases
Load some HTML structure for the page,
Use JavaScript to fill in the contents
You can prove this to yourself by right clicking on the page, do "view page source" and then searching for "events-container" (it is not there).
So you'll need something more powerful than requests + bs4. I have heard of folks using Selenium to do this, but I'm not familiar with it.
You should consider using urllib3 instead of requests.
from urllib.request import Request, urlopen.
- build your req:
req = Request(url, headers={'User-Agent': 'Mozilla/5.0'})
- retrieve document:
res = urlopen(req)
- parse it using bs4:
html = BeautifulSoup (res, 'html.parser')
Like Chris Curvey described, the problem is that requests can't execute the JavaScript of the page. If you print your content variable you can see that the page would display a message like: "JavaScript Required! To provide you with the best possible product, our website requires JavaScript to function..." With Selenium you control an full browser in form of an WebDriver (for eample ChromeDriver binary for the Google Chrome Browser):
from bs4 import BeautifulSoup
from selenium import webdriver
chrome_options = webdriver.ChromeOptions()
# chrome_options.add_argument('headless')
driver = webdriver.Chrome(chrome_options = chrome_options)
url = "https://www.betpawa.ug/"
driver.get(url)
page = driver.page_source
content = BeautifulSoup(page, 'html.parser')
for match in content.findAll("div",attrs={"class":"events-container"}):
print (match.text.strip())
Update:
In Line 13 the command print (match.text.strip()) simply extract only the text elements for each match-div's wich has the class-attribute "events-container".
If you want to extract more specific content you can access each match over the match variable.
You need to know:
which of the avabile information you want
and how to indentify this information inside the match-div's
structure.
in which data-type you need this information
To make it easy run the program, open the developer tools of chrome with key F12, on the left top corner you see now the icon for "select an element ...",
if you click on the icon and click in the browser on the desired element you see in the area under the icon the equivalent source.
Analyse it carefully to get the info's you need, for example:
The Title of the Football match is the first h3-Tag in the match-div
and is an string
The Odd's shown are span-tag's with the class event-odds and an
number (float/double)
Search the function you need in Google or in the reference to the package you use (BeautifulSoup4).
Let's try to get it quick and dirty by using the BeautifulSoup functions on the match variable to don't get the elements of the full site (have replaced the whitespace with tabs):
# (1) lets try to find the h3-tag
title_tags = match.findAll("h3") # use on match variable
if len(title_tags) > 0: # at least one found?
title = title_tags[0].getText() # get the text of the first one
print("Title: ", title) # show it
else:
print("no h3-tags found")
exit()
# (2) lets try to get some odds as numbers in the order in which they are displayed
odds_tags = match.findAll("span", attrs={"class":"event-odds"})
if len(odds_tags) > 2: # at least three found?
odds = [] # create an list
for tag in odds_tags: # loop over the odds_tags we found
odd = tag.getText() # get the text
print("Odd: ", odd)
# good but it is an string, you can't compare it with an number in
# python and expect an good result.
# You have to clean it and convert it:
clean_odd = odd.strip() # remove empty spaces
odd = float(clean_odd) # convert it to float
print("Odd as Number:", odd)
else:
print("something wen't wrong with the odds")
exit()
input("Press enter to try it on the next match!")

Categories

Resources