I am trying to find a relative (not absolute) Xpath that will allow me to import the first table after the text 'SPLIT TIMES'. This is my code:
from lxml import html
import requests
ResultsPage = requests.get('https://www.iaaf.org/competitions/iaaf-world-championships/iaaf-world-championships-london-2017-5151/results/men/10000-metres/final/result')
ResultsTree = html.fromstring(ResultsPage.content)
ResultsTable = ResultsTree.xpath(("""//*[text()[contains(normalize-space(), "SPLIT TIMES")]]"""))
print ResultsTable
I am trying to find the Xpath that will hone in on the 'SPLIT TIMES' table that is found here https://www.iaaf.org/competitions/iaaf-world-championships/iaaf-world-championships-london-2017-5151/results/men/10000-metres/final/result and shown in the image below.
I would be grateful if the Xpath could be as versatile as possible. For example, the requirement may change so that I find the first table after the text which reads '10,000 METRES MEN' (same url as above). Or, I may need to find the first table after the text which reads 'MEDAL TABLE' (different url): https://www.iaaf.org/competitions/iaaf-world-championships/iaaf-world-championships-london-2017-5151/medaltable
There is a problem with your code because that Website you are trying to scrape uses a protection that will deny the request (the User-Agent is missing in the header as pointed out in the other answer):
The request could not be satisfied. Request blocked. Generated by
cloudfront (CloudFront)
I was able to bypass this by using this library: cloudflare-scrape.
You can install it using pip:
pip install cfscrape
And here is the code with a working xpath for what you are trying to achieve, the trick was to use the "following" axe as described in the documentation: https://www.w3.org/TR/xpath/#axes.
import cfscrape
from lxml import html
scraper = cfscrape.create_scraper()
page = scraper.get('https://www.iaaf.org/competitions/iaaf-world-championships/iaaf-world-championships-london-2017-5151/results/men/10000-metres/final/result')
tree = html.fromstring(page.content)
table = tree.xpath(".//h2[contains(text(), 'Split times')][1]/following::table[1]")
You can use following by xpath, something like below.
relative_string = "Split times"
ResultsTable = ResultsTree.xpath("//*[text()[contains(normalize-space(), '"+relative_string+"')]]/following::table")
Related
I am currently learning web scraping with python. I'm reading Web scraping with Python by Ryan Mitchell.
I am stuck at Crawling Sites Through Search. For example, reuters search given in the book works perfectly but when I try to find it by myself, as I will do in the future, I get this link.
Whilst in the second link it is working for a human, I cannot figure out how to scrape it due to weird class names like this class="media-story-card__body__3tRWy"
The first link gives me simple names, like this class="search-result-content" that I can scrape.
I've encountered the same problem on other sites too. How would I go about scraping it or finding a link with normal names in the future?
Here's my code example:
from bs4 import BeautifulSoup
import requests
from rich.pretty import pprint
text = "hello"
url = f"https://www.reuters.com/site-search/?query={text}"
response = requests.get(url)
soup = BeautifulSoup(response.text, "lxml")
results = soup.select("div.media-story-card__body__3tRWy")
for result in results:
pprint(result)
pprint("###############")
You might resort to a prefix attribute value selector, like
div[class^="media-story-card__body__"]
This assumes that the class is the only one ( or at least notationally the first ). However, the idea can be extended to checking for a substring.
I am a python programmer. I want to extract all of table data in below link by beautifulsoup library.
This is the link: https://finance.yahoo.com/quote/GC%3DF/history?p=GC%3DF[enter image description here]1
You'll want to look into web scraping tutorials.
Here's one to get you started: https://realpython.com/python-web-scraping-practical-introduction/
This kind of thing can get a little complicated with complex mark-up, and I'd say the provided link in the question post qualifies as slightly complex mark-up, but basically, you want to find the container div object with "Pb(10px) Ovx(a) W(100%)" classes or table container with data-test attribute of "historical-prices". Drill down to the mark-up data you need from there.
HOWEVER, if you insist on using BeautifulSoup library, here's a tutorial for that: https://realpython.com/beautiful-soup-web-scraper-python/
Scroll down to step 3: "Parse HTML Code With Beautiful Soup"
install the library: python -m pip install beautifulsoup4
Then, use the following code to scrape the page:
import requests
from bs4 import BeautifulSoup
URL = "https://finance.yahoo.com/quote/GC%3DF/history?p=GC%3DF"
page = requests.get(URL)
soup = BeautifulSoup(page.content, "html.parser")
Then, find the table container with data-test attribute of "historical-prices" which I mentioned earlier:
results = soup.find(attrs={"data-test" : "historical-prices"})
Thanks to this other StackOverflow post for this info on the attrs parameter: Extracting an attribute value with beautifulsoup
From there, you'll want to drill down. I'm not really sure how to do this step properly, as I never did this in Python before, but there are multiple ways to go about doing this. My preferred way would be to use the find method or findAll method on the initial result set:
result_set = results.find("tbody", recursive=False).findAll("tr")
Alternatively, you may be able to use the deprecated findChildren method:
result_set = results.findChildren("tbody", recursive=False)
result_set2 = result_set.findChildren("tr", recursive=False)
You may require a results set loop for each drill-down. The page you mentioned doesn't make things easy, mind you. You'll have to drill down multiple times to find the proper tr elements. Of course, the above code is only example code, not properly tested.
venturing into the world of python. I've done the codeacademy course and traweled through stack and youtube but hitting an issue I cant solve.
I'm attempting to do a simple print of a table located in wikipedia, failing misreably at writing my own code I decided to use a tutorial example and build off. However this isn't working and I haven't the foggest idea why.
This is the code here with the appropiate link included. My end result is an empty list "[ ]". I'm using PyCharm 2017.2, beautifulsoup 4.6.0, requests 2.18.4 & python 3.6.2. Any advice appreciated. For reference, the tutorial website is here
import requests
from bs4 import BeautifulSoup
WIKI_URL = "https://en.wikipedia.org/wiki/List_of_volcanoes_by_elevation"
req = requests.get(WIKI_URL)
soup = BeautifulSoup(req.content, 'lxml')
table_classes = {"class": ["sortable", "plainrowheaders"]}
wikitables = soup.findAll("table", table_classes)
print(wikitables)
You can accomplish that using regular expressions.
You get site content by requests.get(WIKI_URL).content
See source code of the site to see how Wikipedia presents tables in HTML.
Find a regular expression that can fit whole table (might be something like <table>(?P<table>*+?)</table>). What this does is get anything between <table> and </table> tokens. Good documentation for regex with python. Take a look at re.findall().
Now you are left with table data. You can use regular expressions again to get data for each row, then regex on each row to get columns. re.findall() is the key again.
I would like to get #src value '/pol_il_DECK-SANTA-CRUZ-STAR-WARS-EMPIRE-STRIKES-BACK-POSTER-8-25-20135.jpg' from webpage
from lxml import html
import requests
URL = 'http://systemsklep.pl/pol_m_Kategorie_Deskorolka_Deski-281.html'
session = requests.session()
page = session.get(URL)
HTMLn = html.fromstring(page.content)
print HTMLn.xpath('//html/body/div[1]/div/div/div[3]/div[19]/div/a[2]/div/div/img/#src')[0]
but I can't. No matter how I format xpath, i tdooesnt work.
In the spirit of #pmuntima's answer, if you already know it's the 14th sourced image, but want to stay with lxml, then you can:
print HTMLn.xpath('//img/#data-src')[14]
To get that particular image. It similarly reports:
/pol_il_DECK-SANTA-CRUZ-STAR-WARS-EMPIRE-STRIKES-BACK-POSTER-8-25-20135.jpg
If you want to do your indexing in XPath (possibly more efficient in very large result sets), then:
print HTMLn.xpath('(//img/#data-src)[14]')[0]
It's a little bit uglier, given the need to parenthesize in the XPath, and then to index out the first element of the list that .xpath always returns.
Still, as discussed in the comments above, strictly numerical indexing is generally a fragile scraping pattern.
Update: So why is the XPath given by browser inspect tools not leading to the right element? Because the content seen by a browser, after a dynamic JavaScript-based update process, is different from the content seen by your request. Your request is not running JS, and is doing no such updates. Different content, different address needed--if the address is static and fragile, at any rate.
Part of the updates here seem to be taking src URIs, which initially point to an "I'm loading!" gif, and replacing them with the "real" src values, which are found in the data-src attribute to begin.
So you need two changes:
a stronger way to address the content you want (a way that doesn't break when you move from browser inspect to program fetch) and
to fetch the URIs you want from data-src not src, because in your program fetch, the JS has not done its load-and-switch trick the way it did in the browser.
If you know text associated with the target image, that can be the trick. E.g.:
search_phrase = 'DECK SANTA CRUZ STAR WARS EMPIRE STRIKES BACK POSTER'
path = '//img[contains(#alt, "{}")]/#data-src'.format(search_phrase)
print HTMLn.xpath(path)[0]
This works because the alt attribute contains the target text. You look for images that have the search phrase contained in their alt attributes, then fetch the corresponding data-src values.
I used a combination of requests and beautiful soup libraries. They both are wonderful and I would recommend them for scraping and parsing/extracting HTML. If you have a complex scraping job, scrapy is really good.
So for your specific example, I can do
from bs4 import BeautifulSoup
import requests
URL = 'http://systemsklep.pl/pol_m_Kategorie_Deskorolka_Deski-281.html'
r = requests.get(URL)
soup = BeautifulSoup(r.text, "html.parser")
specific_element = soup.find_all('a', class_="product-icon")[14]
res = specific_element.find('img')["data-src"]
print(res)
It will print out
/pol_il_DECK-SANTA-CRUZ-STAR-WARS-EMPIRE-STRIKES-BACK-POSTER-8-25-20135.jpg
I'm trying to learn to scrape webpage (http://www.expressobeans.com/public/detail.php/185246), however I don't know what I'm doing wrong. I think it's to do with identifing the xpath but how do I get the correct path (if that is the issue)? I've tried Firebug in Firefox as well as the Developer Tools in Chrome.
I want to be able to scrape the Manufacturer value (D&L Screenprinting) as well as all the Edition Details.
python script:
from lxml import html
import requests
page = requests.get('http://www.expressobeans.com/public/detail.php/185246')
tree = html.fromstring(page.text)
buyers = tree.xpath('//*[#id="content"]/table/tbody/tr[2]/td/table/tbody/tr/td[1]/dl/dd[3]')
print buyers
returns:
[]
remove tbody from the xpath
buyers = tree.xpath('//*[#id="content"]/table/tr[2]/td/table/tr/td[1]/dl/dd[3]')
I'd start by suggesting you look at the page HTML and try to find a node closer to the value you are looking for, and build your path from there to make it shorter and easier to follow.
In that page I can see that there is a "dl" with class "itemListingInfo" and under that one all the information you are looking for.
Also, if you want the "D&L Screenprinting" text, you need to extract the text from the link.
Try with this modified version, it should be straightforward to add the other xpath expressions and get the other fields as well.
from lxml import html
import requests
page = requests.get('http://www.expressobeans.com/public/detail.php/185246')
tree = html.fromstring(page.text)
buyers = tree.xpath('//dl[#class="itemListingInfo"]/dd[2]/a/text()')
print buyers