I'm currently trying to parse out the different configurations on the following web page:
https://nvd.nist.gov/vuln/detail/CVE-2020-3661
Specifically, I'm looking for data within the 'div' tags with ID's matching 'config-div-\d+'
I've tried running the following:
parsed = BeautifulSoup(requests.get('https://nvd.nist.gov/vuln/detail/CVE-2020-3661').content, 'html5lib')
configs = parsed.findAll('div', {'id' : re.compile('config-div-\d+')})
However, configs would always yield None. I've tried all of the parsers that were listed on the BeautifulSoup documentation page and they all yield the same result.
So I try only finding the parent node (div id="vulnCpeTree") and made sure I had the right one by manually drilling down from body.
parent = parsed.find('div', {'id' : 'vulnCpeTree'})
parent.findChildren()
[]
Is there a limit to the depth to which all of the parsers can parse? Is there a way to change this / work around this? Or did I miss something?
Thank you for your help!
Both #Sushanth and #Scratch'N'Purr provided great answers for this question. #Sushanth provided the method that you're supposed to use when trying to stay up to date with vulnerabilities and #Scratch'N'Purr provided an explanation as to why I was experiencing the issues in the original question if I really wanted to go down the route of web-scraping.
Thank you both.
Related
In my example code below I have navigated to Obama's first Instagram post. I am trying to point to the portion of the page that is his post and the comments beside it.
driver.get("https://www.instagram.com/p/B-Sj7CggmHt/")
element = driver.find_element_by_css_selector("div._97aPb")
I want this to work for the page of any post and of any Instagram user, but it seems that the xpath for the post alongside the comments changes. How can I find the post image + comments combined block regardless of which post it is? Would appreciate any help thank you.
I would also like to be able to individually point to the image and individually point to the comments. I have gone through multiple user profiles and multiple posts but both the xpaths and css selectors seem to change. Would also appreciate guidance on any reading or resources where I can learn how to properly point to different html elements.
You could try selecting based on the top level structure. Looking more closely, there is always an article tag, and then the photo is in the 4th div in, right under the header.
You can do this with BeautifulSoup with something like this:
from BeautifulSoup import BeautifulSoup as soup
article = soup.find('article')
divs_in_article = article.find_all('div')
divs_in_article[3] should have the data you are looking for. If BeautifulSoup grabs dives under that first header tag, you may have to get creative and skip that tag first. I would test it myself but I don't have ChromeDriver running right now.
Alternatively, you could try:
images = soup.find_all('img')
to grab all image tags in the page. This may work too.
BeautifulSoup has a lot of handy methods to get you tagging things based on structure. Take a look at going back and forth, going sideways , going down and going up. You should be able to discern the structure using the developer tools in your browser and then come up with a way to select the collections you care about for comments.
I'm making a system - mostly in Python with Scrapy - in which I can, basically, find information about a specific product. But the thing is that the request URL is massive huge, I got a clue that I should change some parts of it with variables to reach that specific product in which I would like to search for, but the URL has so many fields that I don't know, for sure, how to make it.
e.g: "https://www.amazon.com.br/s?k=demi+lovato+365+dias+do+ano&adgrpid=86887777368&hvadid=392971063429&hvdev=c&hvlocphy=9047761&hvnetw=g&hvpos=1t1&hvqmt=e&hvrand=11390662277799676774&hvtargid=kwd-597187395757&hydadcr=5658_10696978&tag=hydrbrgk-20&ref=pd_sl_21pelgocuh_e%2Frobot.txt"
"demi+lovato+365+dias+do+ano" it's the book title, but I can see a lot of information on URL that I simply can't supply and of course, it changes from title to title. One solution I thought could be possible was to POST on search bar the title in which I was looking for and find it on result page but I don't know if it's the best approach since in fact, this is the first time I'll be working with web scraping.
Someone has some tip for how can I do that. All I could find was how to scrape all products for price comparison, scrape specific information about all these products and things like that but nothing about search for specific products.
Thanks for any contribs, this is very important for me and sorry about anything, I'm not a very present user and I'm not an English native speaker.
Feel free to make me any advice about user behavior, be better is always something I aim to.
You should use rule available in scrapy framework. This will help you to define how to navigate the site and its sub-site. Additionally you can configure other tags like span or div other than anchor tags to look for url of the link. By this way, additional query params in the link will be populated by the scrapy session as it emulate click on the hypelinks. If you skip the additional query params in the URL, there is a high chance that you will be blocked
How does scrapy use rules?
You don't need to follow that long link at all, often the different parameters are associated with your current session or settings/filters and you can keep only what you need.
Here is what I meant:
You can generate same result using these 2 urls:
https://www.amazon.com.br/s?k=demi+lovato+365+dias+do+ano
https://www.amazon.com.br/s?k=demi+lovato+365+dias+do+ano&adgrpid=86887777368&hvadid=392971063429&hvdev=c&hvlocphy=9047761&hvnetw=g&hvpos=1t1&hvqmt=e&hvrand=11390662277799676774&hvtargid=kwd-597187395757&hydadcr=5658_10696978&tag=hydrbrgk-20&ref=pd_sl_21pelgocuh_e%2Frobot.txt
If both links are generating same results then that's it, otherwise you will definitely have to play with different parameters, you can't predict website behavior without actually doing the test and having a lot of parameters is an issue then try something like:
from urllib.parse import quote_plus
base_url = "https://www.amazon.com.br"
link = base_url + "/k=%s&adgrpid=%s&hvadid=%s" % ( quote_plus(title), '86887777368', '392971063429' )
I'm relatively new to python and wanted to see if there is any means to scrap the inspect Element section of the RatemyProfessor site. My goal is to obtain all the professor ID's which are only located in that area.
When attempting to obtain the code I tried..
import requests
r = requests.get('http://www.ratemyprofessors.com/search.jsp?queryBy=schoolId&schoolName=California+State+University%2C+Northridge&schoolID=163&queryoption=TEACHER')
print (r.text)
But unfortunately only received the source page information, which doesn't provide the id information.
The id's are located in the Inspect Element section, and I was wondering if there is a special link I'm just not seeing that would help me extract this data
This is for a college project, if anyone was curious, any suggestions will help!
Thanks again!
UPDATE
Thank you for all the feedback I really appreciate it, but i'm still not understanding the logic of how I would be able to obtain the information of the elements with the link of the source code
Here I placed arrows indicating what i'm seeing, the link in my "requests.get" provides the code on the left, and my goal is to find a url, or something to be able to extract the information which is on the right.
I really want to understand what is going on, and the proper way to approach this, if someone can explain this to me the process of how this can be achieved I would greatly appreciate it.
Once again thank you everyone for contributing I really appreicate it!
I did not test, but you can use the lib beautifulSoup to parse the hml code, and after that find all div with class 'result-list' and make a find_all with all 'li' html code. Now you can get the id of that li, split the result and get the last position. Something like that:
import requests
from bs4 import BeautifulSoup
r = requests.get('http://www.ratemyprofessors.com/search.jsp?queryBy=schoolId&schoolName=California+State+University%2C+Northridge&schoolID=163&queryoption=TEACHER')
page = BeautifulSoup(r.content, 'html.parser')
for divtag in soup.find_all('div', {'class': 'result-list'}):
for litag in ultag.find_all('li'):
print litag.text
I dit not test my code, but the logic is that.
Just a heads up: it is against Rate My Professors TOS to scrape data from their site. You may want to abandon this project.
I know there are many similar questions, but I've been through all of those and they couldn't help me. I'm trying to get information from a website, and I've used the same method on other websites with success. Here however, it doesn't work. I would very much appreciate if somebody could give me a few tips!
I want to get the max temperature for tomorrow from this website.
import re, requests, time
from lxml import html
page = requests.get('http://www.weeronline.nl/Europa/Nederland/Amsterdam/4058223')
tree = html.fromstring(page.content)
a = tree.xpath('//*[#id="app"]/div/div[2]/div[5]/div[2]/div[2]/div[6]/div/div/div/div/div/div/ul/div[2]/div/li[1]/div/span/text()')
print(a)
This returns an empty list, however. The same method on a few other websites I checked worked fine. I've tried applying this method on other parts of this website and this domain, all to no avail.
Thanks for any and all help!
Best regards
Notice that when you try to open that page you are asked whether you agree to allow cookies. (It's something like that, I have no Dutch.) You will need to use something like selenium to click on a button to 'OK' that so that you have access to the page that you really want. Then you can use the technique discussed at Web Scrape page with multiple sections to be able to get the HTML for that page, and finally apply whatever xpath it takes to retrieve the content that you want.
I would like to parse a web page in order to retrieve some information about it (my exact problem is to retrieve all the items in this list : http://www.computerhope.com/vdef.htm).
However, I can't figure out how to do it.
A lot of tutorials on the internet start with this (simplified) :
html5lib.parse(urlopen("http://www.computerhope.com/vdef.htm"))
But after that, none of the tutorials explain how I can browse the document and go the html part I am looking for.
Some other tutorials explain how to do it with CSSSelector but again, all the tutorials don't start with a web page but with a string instead (e.g. here : http://lxml.de/cssselect.html).
So I tried to create a tree with the web page using this :
fromstring(urlopen("http://www.computerhope.com/vdef.htm").read())
but I got this error :
lxml.etree.XMLSyntaxError: Specification mandate value for attribute itemscope, line 3, column 28. This error is due to the fact that there is an attribute that is not specified (e.g. <input attribute></input>) but as I don't control the webpage, I can't go around it.
So here are a few questions that could solve my problems :
How can I browse a tree ?
Is there a way to make the parser less strict ?
Thank you !
Try using beautiful soup, it has some excellent features and makes parsing in Python extremely easy.
Check of their documentation at https://www.crummy.com/software/BeautifulSoup/bs4/doc/
from bs4 import BeautifulSoup
import requests
page = requests.get('http://www.computerhope.com/vdef.htm')
soup = BeautifulSoup(page.text)
tables = soup.findChildren('table')
for i in (tables[0].findAll('a')):
print(i.text)
It prints out all the items in the list, I hope the OP Will make adjustments accordingly.