I would like to parse a web page in order to retrieve some information about it (my exact problem is to retrieve all the items in this list : http://www.computerhope.com/vdef.htm).
However, I can't figure out how to do it.
A lot of tutorials on the internet start with this (simplified) :
html5lib.parse(urlopen("http://www.computerhope.com/vdef.htm"))
But after that, none of the tutorials explain how I can browse the document and go the html part I am looking for.
Some other tutorials explain how to do it with CSSSelector but again, all the tutorials don't start with a web page but with a string instead (e.g. here : http://lxml.de/cssselect.html).
So I tried to create a tree with the web page using this :
fromstring(urlopen("http://www.computerhope.com/vdef.htm").read())
but I got this error :
lxml.etree.XMLSyntaxError: Specification mandate value for attribute itemscope, line 3, column 28. This error is due to the fact that there is an attribute that is not specified (e.g. <input attribute></input>) but as I don't control the webpage, I can't go around it.
So here are a few questions that could solve my problems :
How can I browse a tree ?
Is there a way to make the parser less strict ?
Thank you !
Try using beautiful soup, it has some excellent features and makes parsing in Python extremely easy.
Check of their documentation at https://www.crummy.com/software/BeautifulSoup/bs4/doc/
from bs4 import BeautifulSoup
import requests
page = requests.get('http://www.computerhope.com/vdef.htm')
soup = BeautifulSoup(page.text)
tables = soup.findChildren('table')
for i in (tables[0].findAll('a')):
print(i.text)
It prints out all the items in the list, I hope the OP Will make adjustments accordingly.
Related
Kind of a noob here. Not my first time webscrapping but this one gives me headaches:
Using lxml, I'm trying to scrape some data from a webpage... I managed to extract some data with other websites but I got trouble with this one.
I'm trying to get the value "44 kg CO2-eq/m2" on this website here:
https://www.bs2.ch/energierechner/#/?d=%7B%22area%22%3A%22650%22,%22floors%22%3A%224%22,%22utilization%22%3A2,%22climate%22%3A%22SMA%22,%22year%22%3A4,%22distType%22%3A2,%22dhwType%22%3A1,%22heatType%22%3A%22air%22,%22pv%22%3A0,%22measures%22%3A%7B%22walls%22%3Afalse,%22windows%22%3Afalse,%22roof%22%3Afalse,%22floor%22%3Afalse,%22wrg%22%3Afalse%7D,%22prev%22%3A%7B%22walls%22%3Afalse,%22wallsYear%22%3A1,%22windows%22%3Afalse,%22windowsYear%22%3A1,%22roof%22%3Atrue,%22roofYear%22%3A1,%22floor%22%3Afalse,%22floorYear%22%3A1%7D,%22zipcode%22%3A%228055%22%7D&s=4&i=false
import lxml.etree
from lxml import html
import requests
# Request the page
page = requests.get('https://www.bs2.ch/energierechner/#/?d=%7B%22area%22%3A%22650%22,%22floors%22%3A%224%22,%22utilization%22%3A2,%22climate%22%3A%22SMA%22,%22year%22%3A4,%22distType%22%3A2,%22dhwType%22%3A1,%22heatType%22%3A%22air%22,%22pv%22%3A0,%22measures%22%3A%7B%22walls%22%3Afalse,%22windows%22%3Afalse,%22roof%22%3Afalse,%22floor%22%3Afalse,%22wrg%22%3Afalse%7D,%22prev%22%3A%7B%22walls%22%3Afalse,%22wallsYear%22%3A1,%22windows%22%3Afalse,%22windowsYear%22%3A1,%22roof%22%3Atrue,%22roofYear%22%3A1,%22floor%22%3Afalse,%22floorYear%22%3A1%7D,%22zipcode%22%3A%228055%22%7D&s=4&i=false')
tree = html.fromstring(page.content)
scraped_text = tree.xpath(
'//*[#id="bs2-main"]/div/div[2]/div/div[2]/div[4]/div/div[2]/div[3]/div[2]/div[2]/div/div[2]/div[1]')
print(scraped_text)
From the print argument, i just get a blank list [] as returned value, and not the value I am looking for.
I also tried to used the long XPath, although I now that it is not optimal, because dependend of eventuell changes on the site's structure.
scraped_text = tree.xpath(
'/html/body/div[1]/div/div[5]/main/div[3]/div/div[2]/div/div[2]/div[4]/div/div[2]/div[3]/div[2]/div[2]/div/div[2]/div[1]')
print(scraped_text)
From this XPath, I also get an empty list [] from the print argument.
I checked the correct XPath using "XPath Helper" on Chrome.
I also tried to use BeautifulSoup but without any luck, as it doesn't manage XPaths.
I found a similar problem on Stackoverflow here : Empty List LXML XPATH
As it appear that my XPath is probably wrong defined.
I tried since days to solve this, any help would be nice, thanks!
Edit: I also tried to get another XPath using ChroPath, but I got this feedback:
It might be child of svg/pseudo element/comment/iframe from different src. Currently ChroPath doesn't support for them.
I presume my XPath may be wrong.
You can't find the element because you use requests and the requests don't load JavaScript and this page is loading by javascript.You must switch on Selenium WebDriver
I'm currently trying to parse out the different configurations on the following web page:
https://nvd.nist.gov/vuln/detail/CVE-2020-3661
Specifically, I'm looking for data within the 'div' tags with ID's matching 'config-div-\d+'
I've tried running the following:
parsed = BeautifulSoup(requests.get('https://nvd.nist.gov/vuln/detail/CVE-2020-3661').content, 'html5lib')
configs = parsed.findAll('div', {'id' : re.compile('config-div-\d+')})
However, configs would always yield None. I've tried all of the parsers that were listed on the BeautifulSoup documentation page and they all yield the same result.
So I try only finding the parent node (div id="vulnCpeTree") and made sure I had the right one by manually drilling down from body.
parent = parsed.find('div', {'id' : 'vulnCpeTree'})
parent.findChildren()
[]
Is there a limit to the depth to which all of the parsers can parse? Is there a way to change this / work around this? Or did I miss something?
Thank you for your help!
Both #Sushanth and #Scratch'N'Purr provided great answers for this question. #Sushanth provided the method that you're supposed to use when trying to stay up to date with vulnerabilities and #Scratch'N'Purr provided an explanation as to why I was experiencing the issues in the original question if I really wanted to go down the route of web-scraping.
Thank you both.
I'm attempting to scrape a website, and pull each sheriff's name and county. I'm using devtools in chrome to identify the HTML tag needed to locate that information.
import pandas as pd
import numpy as np
from bs4 import BeautifulSoup
URL = 'https://oregonsheriffs.org/about-ossa/meet-your-sheriffs'
page = requests.get(URL)
soup = BeautifulSoup(page.content, 'html.parser')
sheriff_names = soup.find_all('a', class_ = 'eg-sheriff-list-skin-element-1')
sheriff_counties = soup.find_all(class_ = 'eg-sheriff-list-skin-element-2')
However, I'm finding that Requests is not pulling the entire page's html, even though the tag is at the end. If I scan page.content, I find that Sheriff Harrold is the last sheriff included, and that every sheriff from curtis landers onwards is not included (I tried pasting the full output of page.contents but it's too long).
My best guess from reading this answer is that the website has javascripts that load the remaining part of the page upon interacting with it, which would imply that I need to use something like Selenium to interact with the page to get the rest of it to first load.
However, if you look at the website, it's very simple, so as a novice part of me is thinking that there has to be a way to scrape this basic website without using a more complex tool like Selenium. That said, I recognize that the website is wordpress generated and wordpress can set up delayed javascripts on even simple web sites.
My questions are:
1) do I really need to use Selenium to scrape a simple, word-press generated website like this? Or is there a way to get the full page to load with just Requests? Is there anyway to tell when web pages will require a web driver and when Requests will not be enough?
2) I'm thinking one step ahead here - if I want to scale up this project, how would I be able to tell that Requests has not returned the full website, without manually inspecting the results of every website?
Thanks!
Unfortunately, your initial instinct is almost certainly correct. If you look at the page source it seems that they have some sort of lazy loading going on, pulling content from an external source.
A quick look at the page source indicates that they're probably using the "Essential Grid" WordPress theme to do this. I think this supports preloading. If you look at the requests that are made you might be able to ascertain how it's loading this and pull directly from that source (perhaps a REST call, AJAX, etc).
In a generalized sense, I'm afraid that there really isn't any automated way to programmatically determine if a page has 'fully' loaded, as that behavior is defined in code and can be triggered by anything.
If you want to capture information from pages that load content as you scroll, though, I believe Selenium is the tool you'll have to use.
In my example code below I have navigated to Obama's first Instagram post. I am trying to point to the portion of the page that is his post and the comments beside it.
driver.get("https://www.instagram.com/p/B-Sj7CggmHt/")
element = driver.find_element_by_css_selector("div._97aPb")
I want this to work for the page of any post and of any Instagram user, but it seems that the xpath for the post alongside the comments changes. How can I find the post image + comments combined block regardless of which post it is? Would appreciate any help thank you.
I would also like to be able to individually point to the image and individually point to the comments. I have gone through multiple user profiles and multiple posts but both the xpaths and css selectors seem to change. Would also appreciate guidance on any reading or resources where I can learn how to properly point to different html elements.
You could try selecting based on the top level structure. Looking more closely, there is always an article tag, and then the photo is in the 4th div in, right under the header.
You can do this with BeautifulSoup with something like this:
from BeautifulSoup import BeautifulSoup as soup
article = soup.find('article')
divs_in_article = article.find_all('div')
divs_in_article[3] should have the data you are looking for. If BeautifulSoup grabs dives under that first header tag, you may have to get creative and skip that tag first. I would test it myself but I don't have ChromeDriver running right now.
Alternatively, you could try:
images = soup.find_all('img')
to grab all image tags in the page. This may work too.
BeautifulSoup has a lot of handy methods to get you tagging things based on structure. Take a look at going back and forth, going sideways , going down and going up. You should be able to discern the structure using the developer tools in your browser and then come up with a way to select the collections you care about for comments.
I am having some unknown trouble when using xpath to retrieve text from an HTML page from lxml library.
The page url is www.mangapanda.com/one-piece/1/1
I want to extract the selected chapter name text from the drop down select tag. Now I just want the first option so the XPath to find that is pretty easy. That is :-
.//*[#id='chapterMenu']/option[1]/text()
I verified the above using Firepath and it gives correct data. but when I am trying to use lxml for the purpose I get not data at all.
from lxml import html
import requests
r = requests.get("http://www.mangapanda.com/one-piece/1/1")
page = html.fromstring(r.text)
name = page.xpath(".//*[#id='chapterMenu']/option[1]/text()")
But in name nothing is stored. I even tried other XPath's like :-
//div/select[#id='chapterMenu']/option[1]/text()
//select[#id='chapterMenu']/option[1]/text()
The above were also verified using FirePath. I am unable to figure out what could be the problem. I would request some assistance regarding this problem.
But it is not that all aren't working. An xpath that working with lxml xpath here is :-
.//img[#id='img']/#src
Thank you.
I've had a look at the html source of that page and the content of the element with the id chapterMenu is empty.
I think your problem is that it is filled using javascript and javascript will not be automatically evaluated just by reading the html with lxml.html
You might want to have a look at this:
Evaluate javascript on a local html file (without browser)
Maybe you're able to trick it though... In the end, also javascript needs to fetch the information using a get request. In this case it requests: http://www.mangapanda.com/actions/selector/?id=103&which=191919
Which is json and can be easily turned into a python dict/array using the json library.
But you have to find out how to get the id and the which parameter if you want to automate this.
The id is part of the html, look for document['mangaid'] within one of the script tags and which can maybe stay 191919 has to be 0... although I couldn't find it in any source I found it, when it is 0 you will be redirected to the proper url.
So there you go ;)
The source document of the page you are requesting is in a default namespace:
<html xmlns="http://www.w3.org/1999/xhtml">
even if Firepath does not tell you about this. The proper way to deal with namespaces is to redeclare them in your code, which means associating them with a prefix and then prefixing element names in XPath expressions.
name = page.xpath('//*[#id='chapterMenu']/xhtml:option[1]/text()',
namespaces={'xhtml': 'http://www.w3.org/1999/xhtml'})
Then, the piece of the document the path expression above is concerned with is:
<select id="chapterMenu" name="chapterMenu"></select>
As you can see, there is no option element inside it. Please tell us what exactly you'd like to find.