I'm relatively new to python and wanted to see if there is any means to scrap the inspect Element section of the RatemyProfessor site. My goal is to obtain all the professor ID's which are only located in that area.
When attempting to obtain the code I tried..
import requests
r = requests.get('http://www.ratemyprofessors.com/search.jsp?queryBy=schoolId&schoolName=California+State+University%2C+Northridge&schoolID=163&queryoption=TEACHER')
print (r.text)
But unfortunately only received the source page information, which doesn't provide the id information.
The id's are located in the Inspect Element section, and I was wondering if there is a special link I'm just not seeing that would help me extract this data
This is for a college project, if anyone was curious, any suggestions will help!
Thanks again!
UPDATE
Thank you for all the feedback I really appreciate it, but i'm still not understanding the logic of how I would be able to obtain the information of the elements with the link of the source code
Here I placed arrows indicating what i'm seeing, the link in my "requests.get" provides the code on the left, and my goal is to find a url, or something to be able to extract the information which is on the right.
I really want to understand what is going on, and the proper way to approach this, if someone can explain this to me the process of how this can be achieved I would greatly appreciate it.
Once again thank you everyone for contributing I really appreicate it!
I did not test, but you can use the lib beautifulSoup to parse the hml code, and after that find all div with class 'result-list' and make a find_all with all 'li' html code. Now you can get the id of that li, split the result and get the last position. Something like that:
import requests
from bs4 import BeautifulSoup
r = requests.get('http://www.ratemyprofessors.com/search.jsp?queryBy=schoolId&schoolName=California+State+University%2C+Northridge&schoolID=163&queryoption=TEACHER')
page = BeautifulSoup(r.content, 'html.parser')
for divtag in soup.find_all('div', {'class': 'result-list'}):
for litag in ultag.find_all('li'):
print litag.text
I dit not test my code, but the logic is that.
Just a heads up: it is against Rate My Professors TOS to scrape data from their site. You may want to abandon this project.
Related
In my example code below I have navigated to Obama's first Instagram post. I am trying to point to the portion of the page that is his post and the comments beside it.
driver.get("https://www.instagram.com/p/B-Sj7CggmHt/")
element = driver.find_element_by_css_selector("div._97aPb")
I want this to work for the page of any post and of any Instagram user, but it seems that the xpath for the post alongside the comments changes. How can I find the post image + comments combined block regardless of which post it is? Would appreciate any help thank you.
I would also like to be able to individually point to the image and individually point to the comments. I have gone through multiple user profiles and multiple posts but both the xpaths and css selectors seem to change. Would also appreciate guidance on any reading or resources where I can learn how to properly point to different html elements.
You could try selecting based on the top level structure. Looking more closely, there is always an article tag, and then the photo is in the 4th div in, right under the header.
You can do this with BeautifulSoup with something like this:
from BeautifulSoup import BeautifulSoup as soup
article = soup.find('article')
divs_in_article = article.find_all('div')
divs_in_article[3] should have the data you are looking for. If BeautifulSoup grabs dives under that first header tag, you may have to get creative and skip that tag first. I would test it myself but I don't have ChromeDriver running right now.
Alternatively, you could try:
images = soup.find_all('img')
to grab all image tags in the page. This may work too.
BeautifulSoup has a lot of handy methods to get you tagging things based on structure. Take a look at going back and forth, going sideways , going down and going up. You should be able to discern the structure using the developer tools in your browser and then come up with a way to select the collections you care about for comments.
I know there are many similar questions, but I've been through all of those and they couldn't help me. I'm trying to get information from a website, and I've used the same method on other websites with success. Here however, it doesn't work. I would very much appreciate if somebody could give me a few tips!
I want to get the max temperature for tomorrow from this website.
import re, requests, time
from lxml import html
page = requests.get('http://www.weeronline.nl/Europa/Nederland/Amsterdam/4058223')
tree = html.fromstring(page.content)
a = tree.xpath('//*[#id="app"]/div/div[2]/div[5]/div[2]/div[2]/div[6]/div/div/div/div/div/div/ul/div[2]/div/li[1]/div/span/text()')
print(a)
This returns an empty list, however. The same method on a few other websites I checked worked fine. I've tried applying this method on other parts of this website and this domain, all to no avail.
Thanks for any and all help!
Best regards
Notice that when you try to open that page you are asked whether you agree to allow cookies. (It's something like that, I have no Dutch.) You will need to use something like selenium to click on a button to 'OK' that so that you have access to the page that you really want. Then you can use the technique discussed at Web Scrape page with multiple sections to be able to get the HTML for that page, and finally apply whatever xpath it takes to retrieve the content that you want.
So I'm trying to scrape all the subcategories and pages under the category header of the Category page: "Category: Class-based programming languages" found at:
https://en.wikipedia.org/wiki/Category:Class-based_programming_languages
I've figured out a way to do this using urls and the mediawiki API: Categorymembers. The way to do that would be:
base: en.wikipedia.org/w/api.php?action=query&list=categorymembers&cmtitle=Category:Class-based%20programming%20languages&format=json&cmlimit=500
base: en.wikipedia.org/w/api.php?action=query&list=categorymembers&cmtitle=Category:Class-based%20programming%20languages&format=json&cmlimit=500&cmtype=subcat
However, I can't find a way to accomplish this using Python. Can anyone help me out here?
This is for independent study and I've spent a lot of time on this, but just can't seem to figure it out. Also, the use of Beautifulsoup is prohibited. Thank you for all the help!
Ok so after doing more research and study, I was able to find an answer to my own question. Using the libraries urllib.request and json, I imported the wikipedia url file in format json and simply printed its categories out that way. Here's the code I used to get the subcategories:
pages = urllib.request.urlopen("https://en.wikipedia.org/w/api.phpaction=query&list=categorymembers&cmtitle=Category:Class-based%20programming%20languages&format=json&cmlimit=500&cmtype=subcat")
data = json.load(pages)
query = data['query']
category = query['categorymembers']
for x in category:
print (x['title'])
And you can do the same thing for pages in category. Thanks to Nemo for trying to help me out!
import requests
from lxml import html
wiki_page = requests.get('https://en.wikipedia.org/wiki/Category:Class based_programming_languages')
tree = html.fromstring(wiki_page.content)
To build your intuition of how to use this, right click on, say, 'C++', and click 'inspect' and you'll see the panel in the right will have highlighted
<a class="CategoryTreeLabel CategoryTreeLabelNs14
CategoryTreeLabelCategory" href="/wiki/Category:C%2B%2B">C++</a>
Right click on this, and click 'copy xpath'. For C++ this will give you
//*[#id="mw-subcategories"]/div/ul[1]/li/div/div[1]/a
Similarly, under the pages, for 'ActionScript' we get
//*[#id="mw-pages"]/div/div/div[1]/ul/li[1]/a
So if you're looking for all the subcategory/page names, you could do, for example
pages = tree.xpath('//*[#id="mw-pages"]/text()')
subcategories = tree.xpath('//*[#id="mw-subcategories"]/text()')
For more information see here and here
So i'm having a problem grabbing a pages html for some reason when i send a request to the site then use html.fromstring(site.content) it grabs some pages html but then again some of them just print out <Element html at 0x7f6359db3368>
Is there a reason for this? something i can do to fix this? is it some type of security? Also i don't want to use things like Beautiful Soup or Scapy yet.. I Want to learn some more before i decide to get into those libraries...
Maybe this will help a little:
import requests
from lxml import html
a = requests.get('https://www.python.org/')
b = html.fromstring(a.content)
d = b.xpath('.//*[#id="documentation"]/a') #XPath to the blue 'Documentation' near the top of the screen
print(d) #prints [<Element a at 0x104f7f318>]
print(d[0].text) #prints Documentation
You can usually find the XPath with the Chrome Developer tools, after viewing HTML. I'd be happy to give more specific help if you wanted to post the website you're scrapping, and what you're looking for.
I would like to parse a web page in order to retrieve some information about it (my exact problem is to retrieve all the items in this list : http://www.computerhope.com/vdef.htm).
However, I can't figure out how to do it.
A lot of tutorials on the internet start with this (simplified) :
html5lib.parse(urlopen("http://www.computerhope.com/vdef.htm"))
But after that, none of the tutorials explain how I can browse the document and go the html part I am looking for.
Some other tutorials explain how to do it with CSSSelector but again, all the tutorials don't start with a web page but with a string instead (e.g. here : http://lxml.de/cssselect.html).
So I tried to create a tree with the web page using this :
fromstring(urlopen("http://www.computerhope.com/vdef.htm").read())
but I got this error :
lxml.etree.XMLSyntaxError: Specification mandate value for attribute itemscope, line 3, column 28. This error is due to the fact that there is an attribute that is not specified (e.g. <input attribute></input>) but as I don't control the webpage, I can't go around it.
So here are a few questions that could solve my problems :
How can I browse a tree ?
Is there a way to make the parser less strict ?
Thank you !
Try using beautiful soup, it has some excellent features and makes parsing in Python extremely easy.
Check of their documentation at https://www.crummy.com/software/BeautifulSoup/bs4/doc/
from bs4 import BeautifulSoup
import requests
page = requests.get('http://www.computerhope.com/vdef.htm')
soup = BeautifulSoup(page.text)
tables = soup.findChildren('table')
for i in (tables[0].findAll('a')):
print(i.text)
It prints out all the items in the list, I hope the OP Will make adjustments accordingly.