I am a python programmer. I want to extract all of table data in below link by beautifulsoup library.
This is the link: https://finance.yahoo.com/quote/GC%3DF/history?p=GC%3DF[enter image description here]1
You'll want to look into web scraping tutorials.
Here's one to get you started: https://realpython.com/python-web-scraping-practical-introduction/
This kind of thing can get a little complicated with complex mark-up, and I'd say the provided link in the question post qualifies as slightly complex mark-up, but basically, you want to find the container div object with "Pb(10px) Ovx(a) W(100%)" classes or table container with data-test attribute of "historical-prices". Drill down to the mark-up data you need from there.
HOWEVER, if you insist on using BeautifulSoup library, here's a tutorial for that: https://realpython.com/beautiful-soup-web-scraper-python/
Scroll down to step 3: "Parse HTML Code With Beautiful Soup"
install the library: python -m pip install beautifulsoup4
Then, use the following code to scrape the page:
import requests
from bs4 import BeautifulSoup
URL = "https://finance.yahoo.com/quote/GC%3DF/history?p=GC%3DF"
page = requests.get(URL)
soup = BeautifulSoup(page.content, "html.parser")
Then, find the table container with data-test attribute of "historical-prices" which I mentioned earlier:
results = soup.find(attrs={"data-test" : "historical-prices"})
Thanks to this other StackOverflow post for this info on the attrs parameter: Extracting an attribute value with beautifulsoup
From there, you'll want to drill down. I'm not really sure how to do this step properly, as I never did this in Python before, but there are multiple ways to go about doing this. My preferred way would be to use the find method or findAll method on the initial result set:
result_set = results.find("tbody", recursive=False).findAll("tr")
Alternatively, you may be able to use the deprecated findChildren method:
result_set = results.findChildren("tbody", recursive=False)
result_set2 = result_set.findChildren("tr", recursive=False)
You may require a results set loop for each drill-down. The page you mentioned doesn't make things easy, mind you. You'll have to drill down multiple times to find the proper tr elements. Of course, the above code is only example code, not properly tested.
Related
I have decided to view a website's source code, and chose a class, which is "expanded". I wanted to print out all of its contents, with this code:
import requests
from bs4 import BeautifulSoup
page = requests.get("https://www.quora.com/How-can-I-write-a-bot-using-Python")
soup = BeautifulSoup(page.content, 'html.parser')
print soup.find_all(class_='expanded')
but it simply prints out:
[]
Please help me detect what's wrong.
I already saw this thread and tried following what the answer said but it did not help me since this error appears in the terminal:
bs4.FeatureNotFound: Couldn't find a tree builder with the features you requested: lxml. Do you need to install a parser library?
When searching for a class value, you should pass it in like this:
soup.find_all(attrs={"class":"expanded"})
That being said, I don't see anything in the source code of that site with a class called "expanded". The closest thing I could find was class='ui_qtext_expanded'. If that is what you are trying to find, you need to include the whole string.
soup.find_all(attrs={"class":"ui_qtext_expanded"})
I am trying to find a relative (not absolute) Xpath that will allow me to import the first table after the text 'SPLIT TIMES'. This is my code:
from lxml import html
import requests
ResultsPage = requests.get('https://www.iaaf.org/competitions/iaaf-world-championships/iaaf-world-championships-london-2017-5151/results/men/10000-metres/final/result')
ResultsTree = html.fromstring(ResultsPage.content)
ResultsTable = ResultsTree.xpath(("""//*[text()[contains(normalize-space(), "SPLIT TIMES")]]"""))
print ResultsTable
I am trying to find the Xpath that will hone in on the 'SPLIT TIMES' table that is found here https://www.iaaf.org/competitions/iaaf-world-championships/iaaf-world-championships-london-2017-5151/results/men/10000-metres/final/result and shown in the image below.
I would be grateful if the Xpath could be as versatile as possible. For example, the requirement may change so that I find the first table after the text which reads '10,000 METRES MEN' (same url as above). Or, I may need to find the first table after the text which reads 'MEDAL TABLE' (different url): https://www.iaaf.org/competitions/iaaf-world-championships/iaaf-world-championships-london-2017-5151/medaltable
There is a problem with your code because that Website you are trying to scrape uses a protection that will deny the request (the User-Agent is missing in the header as pointed out in the other answer):
The request could not be satisfied. Request blocked. Generated by
cloudfront (CloudFront)
I was able to bypass this by using this library: cloudflare-scrape.
You can install it using pip:
pip install cfscrape
And here is the code with a working xpath for what you are trying to achieve, the trick was to use the "following" axe as described in the documentation: https://www.w3.org/TR/xpath/#axes.
import cfscrape
from lxml import html
scraper = cfscrape.create_scraper()
page = scraper.get('https://www.iaaf.org/competitions/iaaf-world-championships/iaaf-world-championships-london-2017-5151/results/men/10000-metres/final/result')
tree = html.fromstring(page.content)
table = tree.xpath(".//h2[contains(text(), 'Split times')][1]/following::table[1]")
You can use following by xpath, something like below.
relative_string = "Split times"
ResultsTable = ResultsTree.xpath("//*[text()[contains(normalize-space(), '"+relative_string+"')]]/following::table")
venturing into the world of python. I've done the codeacademy course and traweled through stack and youtube but hitting an issue I cant solve.
I'm attempting to do a simple print of a table located in wikipedia, failing misreably at writing my own code I decided to use a tutorial example and build off. However this isn't working and I haven't the foggest idea why.
This is the code here with the appropiate link included. My end result is an empty list "[ ]". I'm using PyCharm 2017.2, beautifulsoup 4.6.0, requests 2.18.4 & python 3.6.2. Any advice appreciated. For reference, the tutorial website is here
import requests
from bs4 import BeautifulSoup
WIKI_URL = "https://en.wikipedia.org/wiki/List_of_volcanoes_by_elevation"
req = requests.get(WIKI_URL)
soup = BeautifulSoup(req.content, 'lxml')
table_classes = {"class": ["sortable", "plainrowheaders"]}
wikitables = soup.findAll("table", table_classes)
print(wikitables)
You can accomplish that using regular expressions.
You get site content by requests.get(WIKI_URL).content
See source code of the site to see how Wikipedia presents tables in HTML.
Find a regular expression that can fit whole table (might be something like <table>(?P<table>*+?)</table>). What this does is get anything between <table> and </table> tokens. Good documentation for regex with python. Take a look at re.findall().
Now you are left with table data. You can use regular expressions again to get data for each row, then regex on each row to get columns. re.findall() is the key again.
I'm trying to use BeautifulSoup to parse some HTML in Python. Specifically, I'm trying to create two arrays of soup objects: one for the dates of postings on a website, and one for the postings themselves. However, when I use findAll on the div class that matches the postings, only the initial tag is returned, not the text inside the tag. On the other hand, my code works just fine for the dates. What is going on??
# store all texts of posts
texts = soup.findAll("div", {"class":"quote"})
# store all dates of posts
dates = soup.findAll("div", {"class":"datetab"})
The first line above returns only
<div class="quote">
which is not what I want. The second line returns
<div class="datetab">Feb<span>2</span></div>
which IS what I want (pre-refining).
I have no idea what I'm doing wrong. Here is the website I'm trying to parse. This is for homework, and I'm really really desperate.
Which version of BeautifulSoup are you using? Version 3.1.0 performs significantly worse with real-world HTML (read: invalid HTML) than 3.0.8. This code works with 3.0.8:
import urllib2
from BeautifulSoup import BeautifulSoup
page = urllib2.urlopen("http://harvardfml.com/")
soup = BeautifulSoup(page)
for incident in soup.findAll('span', { "class" : "quote" }):
print incident.contents
That site is powered by Tumblr. Tumblr has an API.
There's a python port of Tumblr that you can use to read documents.
from tumblr import Api
api = Api('harvardfml.com')
freq = {}
posts = api.read()
for post in posts:
#do something here
for your bogus findAll, without the actual source code of your program it is hard to see what is wrong.
I'm using Python and BeautifulSoup to parse HTML pages. Unfortunately, for some pages (> 400K) BeatifulSoup is truncating the HTML content.
I use the following code to get the set of "div"s:
findSet = SoupStrainer('div')
set = BeautifulSoup(htmlSource, parseOnlyThese=findSet)
for it in set:
print it
At a certain point, the output looks like:
correct string, correct string, incomplete/truncated string ("So, I")
although, the htmlSource contains the string "So, I am bored", and many others. Also, I would like to mention that when I prettify() the tree I see the HTML source truncated.
Do you have an idea how can I fix this issue?
Thanks!
Try using lxml.html. It is a faster, better html parser, and deals better with broken html than latest BeautifulSoup. It is working fine for your example page, parsing the entire page.
import lxml.html
doc = lxml.html.parse('http://voinici.ceata.org/~sana/test.html')
print len(doc.findall('//div'))
Code above returns 131 divs.
I found a solution to this problem using BeautifulSoup at beautifulsoup-where-are-you-putting-my-html, because I think it is easier than lxml.
The only thing you need to do is to install:
pip install html5lib
and add it as a parameter to BeautifulSoup:
soup = BeautifulSoup(html, 'html5lib')