So i'm having a problem grabbing a pages html for some reason when i send a request to the site then use html.fromstring(site.content) it grabs some pages html but then again some of them just print out <Element html at 0x7f6359db3368>
Is there a reason for this? something i can do to fix this? is it some type of security? Also i don't want to use things like Beautiful Soup or Scapy yet.. I Want to learn some more before i decide to get into those libraries...
Maybe this will help a little:
import requests
from lxml import html
a = requests.get('https://www.python.org/')
b = html.fromstring(a.content)
d = b.xpath('.//*[#id="documentation"]/a') #XPath to the blue 'Documentation' near the top of the screen
print(d) #prints [<Element a at 0x104f7f318>]
print(d[0].text) #prints Documentation
You can usually find the XPath with the Chrome Developer tools, after viewing HTML. I'd be happy to give more specific help if you wanted to post the website you're scrapping, and what you're looking for.
Related
I am trying to parse the HTML page of a popular music streaming web app with BeautifulSoup, I am using the find_all function to look for X css class.
Workflow looks like:
r = requests.get('URL')
soup = BeautifulSoup(r.content)
soup.select("Tag", class_="Class name here")
The output is an empty list, which tells me it's not finding the class I'm looking for.
Here is the kicker: when I open the developer tools/HTML page source code, I can traverse the tree and find the class I am looking for.
Any ideas for why it's not being loaded. And can I load it into my python instance.
Thank you,
P.S. if any of my semantics/verbiage is incorrect please feel free to edit. I am not a webdev, just an enthusiast. >_<
I'm relatively new to python and wanted to see if there is any means to scrap the inspect Element section of the RatemyProfessor site. My goal is to obtain all the professor ID's which are only located in that area.
When attempting to obtain the code I tried..
import requests
r = requests.get('http://www.ratemyprofessors.com/search.jsp?queryBy=schoolId&schoolName=California+State+University%2C+Northridge&schoolID=163&queryoption=TEACHER')
print (r.text)
But unfortunately only received the source page information, which doesn't provide the id information.
The id's are located in the Inspect Element section, and I was wondering if there is a special link I'm just not seeing that would help me extract this data
This is for a college project, if anyone was curious, any suggestions will help!
Thanks again!
UPDATE
Thank you for all the feedback I really appreciate it, but i'm still not understanding the logic of how I would be able to obtain the information of the elements with the link of the source code
Here I placed arrows indicating what i'm seeing, the link in my "requests.get" provides the code on the left, and my goal is to find a url, or something to be able to extract the information which is on the right.
I really want to understand what is going on, and the proper way to approach this, if someone can explain this to me the process of how this can be achieved I would greatly appreciate it.
Once again thank you everyone for contributing I really appreicate it!
I did not test, but you can use the lib beautifulSoup to parse the hml code, and after that find all div with class 'result-list' and make a find_all with all 'li' html code. Now you can get the id of that li, split the result and get the last position. Something like that:
import requests
from bs4 import BeautifulSoup
r = requests.get('http://www.ratemyprofessors.com/search.jsp?queryBy=schoolId&schoolName=California+State+University%2C+Northridge&schoolID=163&queryoption=TEACHER')
page = BeautifulSoup(r.content, 'html.parser')
for divtag in soup.find_all('div', {'class': 'result-list'}):
for litag in ultag.find_all('li'):
print litag.text
I dit not test my code, but the logic is that.
Just a heads up: it is against Rate My Professors TOS to scrape data from their site. You may want to abandon this project.
I would like to parse a web page in order to retrieve some information about it (my exact problem is to retrieve all the items in this list : http://www.computerhope.com/vdef.htm).
However, I can't figure out how to do it.
A lot of tutorials on the internet start with this (simplified) :
html5lib.parse(urlopen("http://www.computerhope.com/vdef.htm"))
But after that, none of the tutorials explain how I can browse the document and go the html part I am looking for.
Some other tutorials explain how to do it with CSSSelector but again, all the tutorials don't start with a web page but with a string instead (e.g. here : http://lxml.de/cssselect.html).
So I tried to create a tree with the web page using this :
fromstring(urlopen("http://www.computerhope.com/vdef.htm").read())
but I got this error :
lxml.etree.XMLSyntaxError: Specification mandate value for attribute itemscope, line 3, column 28. This error is due to the fact that there is an attribute that is not specified (e.g. <input attribute></input>) but as I don't control the webpage, I can't go around it.
So here are a few questions that could solve my problems :
How can I browse a tree ?
Is there a way to make the parser less strict ?
Thank you !
Try using beautiful soup, it has some excellent features and makes parsing in Python extremely easy.
Check of their documentation at https://www.crummy.com/software/BeautifulSoup/bs4/doc/
from bs4 import BeautifulSoup
import requests
page = requests.get('http://www.computerhope.com/vdef.htm')
soup = BeautifulSoup(page.text)
tables = soup.findChildren('table')
for i in (tables[0].findAll('a')):
print(i.text)
It prints out all the items in the list, I hope the OP Will make adjustments accordingly.
I have been trying to scrape facebook comments using Beautiful Soup on the below website pages.
import BeautifulSoup
import urllib2
import re
url = 'http://techcrunch.com/2012/05/15/facebook-lightbox/'
fd = urllib2.urlopen(url)
soup = BeautifulSoup.BeautifulSoup(fd)
fb_comment = soup("div", {"class":"postText"}).find(text=True)
print fb_comment
The output is a null set. However, I can clearly see the facebook comment is within those above tags in the inspect element of the techcrunch site (I am little new to Python and was wondering if the approach is correct and where I am going wrong?)
Like Christopher and Thiefmaster: it is all because of javascript.
But, if you really need that information, you can still retrieve it thanks to Selenium on http://seleniumhq.org then use beautifulsoup on this output.
Facebook comments are loaded dynamically using AJAX. You can scrape the original page to retrieve this:
<fb:comments href="http://techcrunch.com/2012/05/15/facebook-lightbox/" num_posts="25" width="630"></fb:comments>
After that you need to send a request to some Facebook API that will give you the comments for the URL in that tag.
The parts of the page you are looking for are not included in the source file. Use a browser and you can see this for yourself by opening the page source.
You will need to use something like pywebkitgtk to have the javascript executed before passing the document to BeautifulSoup
I am attempting to write a program that, as an example, will scrape the top price off of this web page:
http://www.kayak.com/#/flights/JFK-PAR/2012-06-01/2012-07-01/1adults
First, I am easily able to retrieve the HTML by doing the following:
from urllib import urlopen
from BeautifulSoup import BeautifulSoup
import mechanize
webpage = 'http://www.kayak.com/#/flights/JFK-PAR/2012-06-01/2012-07-01/1adults'
br = mechanize.Browser()
data = br.open(webpage).get_data()
soup = BeautifulSoup(data)
print soup
However, the raw HTML does not contain the price. The browser does...it's thing (clarification here might help me also)...and retrieves the price from elsewhere while it constructs the DOM tree.
I was led to believe that mechanize would act just like my browser and return the DOM tree, which I am also led to believe is what I see when I look at, for example, Chrome's Developer Tools view of the page (if I'm incorrect about this, how do I go about getting whatever that price information is stored in?) Is there something that I need to tell mechanize to do in order to see the DOM tree?
Once I can get the DOM tree into python, everything else I need to do should be a snap. Thanks!
Mechanize and Beautiful soup are un-beatable tools web-scraping in python.
But you need to understand what is meant for what:
Mechanize : It mimics the browser functionality on a webpage.
BeautifulSoup : HTML parser, works well even when HTML is not well-formed.
Your problem seems to be javascript. The price is getting populated via an ajax call using javascript. Mechanize, however, does not do javascript, so any content that results from javascript will remain invisible to mechanize.
Take a look at this : http://github.com/davisp/python-spidermonkey/tree/master
This does a wrapper on mechanize and Beautiful soup with js execution.
Answering my own question because in the years since asking this I have learned a lot. Today I would use Selenium Webdriver to do this job. Selenium is exactly the tool I was looking for back in 2012 for this type of web scraping project.
https://www.seleniumhq.org/download/
http://chromedriver.chromium.org/