Background information: I'm looking to pull data from ratemyprofessor.com - I have limited programming experience so I decided to see if something was pre-built to accomplish this task.
I came across this here: https://classic.scraperwiki.com/scrapers/ratemyprofessors/
Which is exactly what I'm looking for. ScraperWiki closed down but has it setup to transfer everything to Morph.io - Which I did here: https://morph.io/reddyfire/ratemyprofessors
My problem: It doesn't work. It should be outputting a database that gives me information I've identified as needing. I'm assuming it has something to do with the URL it's pulling from:
response = scraperwiki.scrape("http://www.ratemyprofessors.com/SelectTeacher.jsp?sid=%s&pageNo=%s" % (sid,str(i)))
But I have no idea if that's right. I'm feeling pretty defeated by this but I want to keep going for a solution.
What I need: I'm looking to get the **Name, Department,Total Ratings,Overall Quality, Easiness, and Hotness rating for each instructor at the colleges. Here's some sample output in the desired format:
{"953":("Stanford",32),"799":("Rice",17),"780":("Princeton",16)}
I tested to do a simplified scraper for you. Do note that it isnt pythonic (ie not beautiful or fast), but as a starting point it works.
__author__ = 'Victor'
import urllib
import re
url = 'http://www.ratemyprofessors.com/ShowRatings.jsp?tid=306975'
def crawlURL(addedURL):
url = addedURL
html = urllib.urlopen(url).read()
teacherData = re.findall(r'\">(.*?)</',html)
output = ''
addStuff = 0
for x in xrange(len(teacherData)):
if teacherData[x] == 'Submit a Correction':
output = 'professor: '
for y in xrange(4):
output += teacherData[x-8+y] + ' '
addStuff = 1
elif teacherData[x] == 'Helpfulness' and addStuff == 1:
output += ': Overall quality: '+ str(teacherData[x-2]) + ': Average grade: ' + str(teacherData[x-1]) + ': Helpfulness: ' + teacherData[x+1]
elif teacherData[x] == 'Easiness' and addStuff == 1:
output += ': Easiness: ' + str(teacherData[x+1])
addStuff = 0
break
print output
crawlURL(url)
It renders this output:
Dr. Kimora John Jay College : Overall quality: 5.0: Average grade:
A: Helpfulness: 5.0 : Easiness: 4.6
There is plenty of room for improvement, but this is as close to pseudcode i could get.
In this example it is a function that prints the output, if you want to add it to a list just add a "return output" in the end and call the function with "listName.append(crawlURL(url))"
This is for Python 2.7
And yes, it doesnt get the exact data you requested. It just opens the door for you ;)
EDIT:
Here is an example on how to loop the requests
def crawlURL(addesURL):
...
return output
baseURL = 'http://www.ratemyprofessors.com/ShowRatings.jsp?tid=306'
for x in xrange(50):
url = baseURL + str(x+110)
if crawlURL(url) != '': print crawlURL(url)
If you are iterating through all of their data you should consider to add delays every now and then so you dont accidentally DDoS them.
Related
I just started learning python today, so please take it easy. I have created a method that allows someone to enter a number and it will deliver the park name and weather for that location from MLB and Weather.gov. I have them hard coded for a couple of test cases to make sure it works. I want for the user to be able to input the venue number in it so that I can display the information for the proper location. I have searched around quite a bit but I can't seem to find exactly what Im looking for. For example, in the following url: https://statsapi.mlb.com/api/v1/venues/**3289**?hydrate=location, I want the user to pick the number that goes right there. Once I am able to do this, I should be able to take the latitude and longitude from the API call and use that in the weather API call. I'm just stuck right now. I am using the command line so all I need is for someone to be able to input mlbweather(xxx) and hit return. I tried using params, but that just seems to append to the end of the url and adds a ? and equals, so that doesnt work.
def mlbweather(venueNum):
citi = 3289
wrigley = 17
if venueNum == citi:
mlb_api = requests.get('https://statsapi.mlb.com/api/v1/venues/3289?hydrate=location')
mlb_data = mlb_api.text
parse_json = json.loads(mlb_data)
venueWanted = parse_json['venues'][0]['name']
print("Venue:" + " " + venueWanted)
weather_api = requests.get('https://api.weather.gov/gridpoints/OKX/37,37/forecast')
weather_data = weather_api.text
parse_json = json.loads(weather_data)
weatherWanted = parse_json['properties']['periods'][0]['detailedForecast']
print("Current Weather: \n" + weatherWanted)
elif venueNum == wrigley:
mlb_api = requests.get('https://statsapi.mlb.com/api/v1/venues/17?hydrate=location')
mlb_data = mlb_api.text
parse_json = json.loads(mlb_data)
venueWanted = parse_json['venues'][0]['name']
print("Venue:" + " " + venueWanted)
weather_api = requests.get('https://api.weather.gov/gridpoints/LOT/74,75/forecast')
weather_data = weather_api.text
parse_json = json.loads(weather_data)
weatherWanted = parse_json['properties']['periods'][0]['detailedForecast']
print("Current Weather: \n" + weatherWanted)
else:
print("Either you typed an invalid venue number or we don't have that info")
You're looking for simple string concatenation:
def mlbweather(venueNum):
mlb_api = requests.get('https://statsapi.mlb.com/api/v1/venues/' + str(venueNum) + '?hydrate=location')
mlb_data = mlb_api.text
parse_json = json.loads(mlb_data)
venueWanted = parse_json['venues'][0]['name']
print("Venue:" + " " + venueWanted)
weather_api = requests.get('https://api.weather.gov/gridpoints/OKX/37,37/forecast')
weather_data = weather_api.text
parse_json = json.loads(weather_data)
weatherWanted = parse_json['properties']['periods'][0]['detailedForecast']
print("Current Weather: \n" + weatherWanted)
mlbweather(3289)
Venue: Citi Field
Current Weather:
Patchy fog after 4am. Partly cloudy, with a low around 72. South wind 2 to 7 mph.
Alternatively, you can use fstrings:
mlb_api = requests.get(f'https://statsapi.mlb.com/api/v1/venues/{venueNum}?hydrate=location')
Hi everyone this is my first time here, and I am a beginner in Python. I am in the middle of writing a program that returns a txt document containing information about a stock (Watchlist Info.txt), based on the input of another txt document containing the company names (Watchlist).
To achieve this, I have written 3 functions, of which 2 functions reuters_ticker() and stock_price() are completed as shown below:
def reuters_ticker(desired_search):
#from company name execute google search for and return reuters stock ticker
try:
from googlesearch import search
except ImportError:
print('No module named google found')
query = desired_search + ' reuters'
for j in search(query, tld="com.sg", num=1, stop=1, pause=2):
result = j
ticker = re.search(r'\w+\.\w+$', result)
return ticker.group()
Stock Price:
def stock_price(company, doc=None):
ticker = reuters_ticker(company)
request = 'https://www.reuters.com/companies/' + ticker
raw_main = pd.read_html(request)
data1 = raw_main[0]
data1.set_index(0, inplace=True)
data1 = data1.transpose()
data2 = raw_main[1]
data2.set_index(0, inplace=True)
data2 = data2.transpose()
stock_info = pd.concat([data1,data2], axis=1)
if doc == None:
print(company + '\n')
print('Previous Close: ' + str(stock_info['Previous Close'][1]))
print('Forward PE: ' + str(stock_info['Forward P/E'][1]))
print('Div Yield(%): ' + str(stock_info['Dividend (Yield %)'][1]))
else:
from datetime import date
with open(doc, 'a') as output:
output.write(date.today().strftime('%d/%m/%y') + '\t' + str(stock_info['Previous Close'][1]) + '\t' + str(stock_info['Forward P/E'][1]) + '\t' + '\t' + str(stock_info['Dividend (Yield %)'][1]) + '\n')
output.close()
The 3rd function, watchlist_report(), is where I am getting problems with writing the information in the format as desired.
def watchlist_report(watchlist):
with open(watchlist, 'r') as companies, open('Watchlist Info.txt', 'a') as output:
searches = companies.read()
x = searches.split('\n')
for i in x:
output.write(i + ':\n')
stock_price(i, doc='Watchlist Info.txt')
output.write('\n')
When I run watchlist_report('Watchlist.txt'), where Watchlist.txt contains 'Apple' and 'Facebook' each on new lines, my output is this:
26/04/20 275.03 22.26 1.12
26/04/20 185.13 24.72 --
Apple:
Facebook:
Instead of what I want and would expect based on the code I have written in watchlist_report():
Apple:
26/04/20 275.03 22.26 1.12
Facebook:
26/04/20 185.13 24.72 --
Therefore, my questions are:
1) Why is my output formatted this way?
2) Which part of my code do I have to change to make the written output in my desired format?
Any other suggestions about how I can clean my code and any libraries I can use to make my code nicer are also appreciated!
You handle two different file-handles - the file-handle inside your watchlist_report gets closed earlier so its being written first, before the outer functions file-handle gets closed, flushed and written.
Instead of creating a new open(..) in your function, pass the current file handle:
def watchlist_report(watchlist):
with open(watchlist, 'r') as companies, open('Watchlist Info.txt', 'a') as output:
searches = companies.read()
x = searches.split('\n')
for i in x:
output.write(i + ':\n')
stock_price(i, doc = output) # pass the file handle
output.write('\n')
Inside def stock_price(company, doc=None): use the provided filehandle:
def stock_price(company, output = None): # changed name here
# [snip] - removed unrelated code for this answer for brevity sake
if output is None: # check for None using IS
print( ... ) # print whatever you like here
else:
from datetime import date
output.write( .... ) # write whatever you want it to write
# output.close() # do not close, the outer function does this
Do not close the file handle in the inner function, the context handling with(..) of the outer function does that for you.
The main takeaway for file handling is that things you write(..) to your file are not neccessarily placed there immediately. The filehandler chooses when to actually persist data to your disk, the latests it does that is when it goes out of scope (of the context handler) or when its internal buffer reaches some threshold so it "thinks" it is now prudent to alter to data on your disc. See How often does python flush to a file? for more infos.
I really don't know what's goin' on. I look up (in my search bar) "Hi" and it says 2 results found, if I looks up ANY OTHER WORD it just gives me a 500 error. Anyone?
#! /usr/bin/python
import cgi
data = cgi.FieldStorage()
searchHome = data.getvalue( 'search' )
result = "There was 0 results for search " + searchHome
total = 0
total = int(total)
with open('index.py') as f:
for line in f:
finded = line.find(searchHome)
if finded != -1 and finded != 0:
total += 1
total = str(total)
result = "There was " + total + " result(s) for search term: " + searchHome
print 'Content-type: text/html\n\r\n\r'
print result
Thanks for your help, Reefive / Sky Coleman. EDIT: Just fixed the code :D
More edits:
Ok, the error, I cannot find, I went through all the cPanel files, and more, looked at the help, etc. it just gives me a 500 error when I look up anything other than "Hi".
I'm new here to StackOverflow, but I have found a LOT of answers on this site. I'm also a programming newbie, so i figured i'd join and finally become part of this community - starting with a question about a problem that's been plaguing me for hours.
I login to a website and scrape a big body of text within the b tag to be converted into a proper table. The layout of the resulting Output.txt looks like this:
BIN STATUS
8FHA9D8H 82HG9F RECEIVED SUCCESSFULLY AWAITING STOCKING PROCESS
INVENTORY CODE: FPBC *SOUP CANS LENTILS
BIN STATUS
HA8DHW2H HD0138 RECEIVED SUCCESSFULLY AWAITING STOCKING PROCESS
8SHDNADU 00A123 #2956- INVALID STOCK COUPON CODE (MISSING).
93827548 096DBR RECEIVED SUCCESSFULLY AWAITING STOCKING PROCESS
There are a bunch of pages with the exact same blocks, but i need them to be combined into an ACTUAL table that looks like this:
BIN INV CODE STATUS
HA8DHW2HHD0138 FPBC-*SOUP CANS LENTILS RECEIVED SUCCESSFULLY AWAITING STOCKING PROCESS
8SHDNADU00A123 FPBC-*SOUP CANS LENTILS #2956- INVALID STOCK COUPON CODE (MISSING).
93827548096DBR FPBC-*SOUP CANS LENTILS RECEIVED SUCCESSFULLY AWAITING STOCKING PROCESS
8FHA9D8H82HG9F SSXR-98-20LM NM CORN CREAM RECEIVED SUCCESSFULLY AWAITING STOCKING PROCESS
Essentially, all separate text blocks in this example would become part of this table, with the inv code repeating with its Bin values. I would post my attempts at parsing this data(have tried Pandas/bs/openpyxl/csv writer), but ill admit they are a little embarrassing, as i cannot find any information on this specific problem. Is there any benevolent soul out there that can help me out? :)
(Also, i am using Python 2.7)
A simple custom parser like the following should do the trick.
from __future__ import print_function
def parse_body(s):
line_sep = '\n'
getting_bins = False
inv_code = ''
for l in s.split(line_sep):
if l.startswith('INVENTORY CODE:') and not getting_bins:
inv_data = l.split()
inv_code = inv_data[2] + '-' + ' '.join(inv_data[3:])
elif l.startswith('INVENTORY CODE:') and getting_bins:
print("unexpected inventory code while reading bins:", l)
elif l.startswith('BIN') and l.endswith('MESSAGE'):
getting_bins = True
elif getting_bins == True and l:
bin_data = l.split()
# need to add exception handling here to make sure:
# 1) we have an inv_code
# 2) bin_data is at least 3 items big (assuming two for
# bin_id and at least one for message)
# 3) maybe some constraint checking to ensure that we have
# a valid instance of an inventory code and bin id
bin_id = ''.join(bin_data[0:2])
message = ' '.join(bin_data[2:])
# we now have a bin, an inv_code, and a message to add to our table
print(bin_id.ljust(20), inv_code.ljust(30), message, sep='\t')
elif getting_bins == True and not l:
# done getting bins for current inventory code
getting_bins = False
inv_code = ''
A rather complex one, but this might get you started:
import re, pandas as pd
from pandas import DataFrame
rx = re.compile(r'''
(?:INVENTORY\ CODE:)\s*
(?P<inv>.+\S)
[\s\S]+?
^BIN.+[\n\r]
(?P<bin_msg>(?:(?!^\ ).+[\n\r])+)
''', re.MULTILINE | re.VERBOSE)
string = your_string_here
# set up the dataframe
df = DataFrame(columns = ['BIN', 'INV', 'MESSAGE'])
for match in rx.finditer(string):
inv = match.group('inv')
bin_msg_raw = match.group('bin_msg').split("\n")
rxbinmsg = re.compile(r'^(?P<bin>(?:(?!\ {2}).)+)\s+(?P<message>.+\S)\s*$', re.MULTILINE)
for item in bin_msg_raw:
for m in rxbinmsg.finditer(item):
# append it to the dataframe
df.loc[len(df.index)] = [m.group('bin'), inv, m.group('message')]
print(df)
Explanation
It looks for INVENTORY CODE and sets up the groups (inv and bin_msg) for further processing in afterwork() (note: it would be easier if you had only one line of bin/msg as you need to split the group here afterwards).
Afterwards, it splits the bin and msg part and appends all to the df object.
I had a code written for a website scrapping which may help you.
Basically what you need to do is write click on the web page go to html and try to find the tag for the table you are looking for and using the module (i am using beautiful soup) extract the information. I am creating a json as I need to store it into mongodb you can create table.
#! /usr/bin/python
import sys
import requests
import re
from BeautifulSoup import BeautifulSoup
import pymongo
def req_and_parsing():
url2 = 'http://businfo.dimts.in/businfo/Bus_info/EtaByRoute.aspx?ID='
list1 = ['534UP','534DOWN']
for Route in list1:
final_url = url2 + Route
#r = requests.get(final_url)
#parsing_file(r.text,Route)
outdict = []
outdict = [parsing_file( requests.get(url2+Route).text,Route) for Route in list1 ]
print outdict
conn = f_connection()
for i in range(len(outdict)):
insert_records(conn,outdict[i])
def parsing_file(txt,Route):
soup = BeautifulSoup(txt)
table = soup.findAll("table",{"id" : "ctl00_ContentPlaceHolder1_GridView2"})
#trtags = table[0].findAll('tr')
tdlist = []
trtddict = {}
"""
for trtag in trtags:
print 'print trtag- ' , trtag.text
tdtags = trtag.findAll('td')
for tdtag in tdtags:
print tdtag.text
"""
divtags = soup.findAll("span",{"id":"ctl00_ContentPlaceHolder1_ErrorLabel"})
for divtag in divtags:
for divtag in divtags:
print "div tag - " , divtag.text
if divtag.text == "Currently no bus is running on this route" or "This is not a cluster (orange bus) route":
print "Page not displayed Errored with below meeeage for Route-", Route," , " , divtag.text
sys.exit()
trtags = table[0].findAll('tr')
for trtag in trtags:
tdtags = trtag.findAll('td')
if len(tdtags) == 2:
trtddict[tdtags[0].text] = sub_colon(tdtags[1].text)
return trtddict
def sub_colon(tag_str):
return re.sub(';',',',tag_str)
def f_connection():
try:
conn=pymongo.MongoClient()
print "Connected successfully!!!"
except pymongo.errors.ConnectionFailure, e:
print "Could not connect to MongoDB: %s" % e
return conn
def insert_records(conn,stop_dict):
db = conn.test
print db.collection_names()
mycoll = db.stopsETA
mycoll.insert(stop_dict)
if __name__ == "__main__":
req_and_parsing()
#martineau I have updated my codes, is this what you meant ? How do i handle KeyError instead of NameError ?
url = "http://app2.nea.gov.sg/anti-pollution-radiation-protection/air-pollution/psi/psi-readings-over-the-last-24-hours"
web_soup = soup(urllib2.urlopen(url))
table = web_soup.find(name="div", attrs={'class': 'c1'}).find_all(name="div")[4].find_all('table')[0]
data = {}
cur_time = datetime.datetime.strptime("12AM", "%I%p")
for tr_index, tr in enumerate(table.find_all('tr')):
if 'Time' in tr.text:
continue
for td_index, td in enumerate(tr.find_all('td')):
if not td_index:
continue
data[cur_time] = td.text.strip()
if td.find('strong'):
bold_time = cur_time
data[bold_time] = '20'
cur_time += datetime.timedelta(hours=1)
default_value = '20' # whatever you want it to be
try:
bold = data[bold_time]
except NameError:
bold_time = beforebold = beforebeforebold = default_value
# might want to set "bold" to something, too, if needed
else:
beforebold = data.get(bold_time - datetime.timedelta(hours=1))
beforebeforebold = data.get(bold_time - datetime.timedelta(hours=2))
This is where I print my data to do calculation.
print bold
print beforebold
print beforebeforebold
You need to add something to set data[bold_time]:
if td.find('strong'):
bold_time = cur_time
data[bold_time] = ????? # whatever it should be
cur_time += datetime.timedelta(hours=1)
This should avoid both the NameError and KeyError exceptions as long as the word strong is found. You still might want to code defensively and handle one or both of them gracefully. That what exception where meant to do, handle those exceptional cases that shouldn't happen...
I had read your previous post before it disappeared, and then I've read this one.
I find it a pity to use BeautifulSoup for your goal, because, from the code I see, I find its use complicated, and the fact is that regexes run roughly 10 times faster than BeautifulSoup.
Here's the code with only re, that furnishes the data you are interested in.
I know, there will people to say that HTML text can't be parsed by regexs. I know, I know... but I don't parse the text, I directly find the chunks of text that are interesting. The source code of the webpage of this site is apparently very well structured and it seems there is little risk of bugs. Moreover, tests and verification can be added to keep watch on the source code and to be instantly informed on the possible changings made by the webmaster in the webpage
import re
from httplib import HTTPConnection
hypr = HTTPConnection(host='app2.nea.gov.sg',
timeout = 300)
rekete = ('/anti-pollution-radiation-protection/'
'air-pollution/psi/'
'psi-readings-over-the-last-24-hours')
hypr.request('GET',rekete)
page = hypr.getresponse().read()
patime = ('PSI Readings.+?'
'width="\d+%" align="center">\r\n'
' *<strong>Time</strong>\r\n'
' *</td>\r\n'
'((?: *<td width="\d+%" align="center">'
'<strong>\d+AM</strong>\r\n'
' *</td>\r\n)+.+?)'
'width="\d+%" align="center">\r\n'
' *<strong>Time</strong>\r\n'
' *</td>\r\n'
'((?: *<td width="\d+%" align="center">'
'<strong>\d+PM</strong>\r\n'
' *</td>\r\n)+.+?)'
'PM2.5 Concentration')
rgxtime = re.compile(patime,re.DOTALL)
patline = ('<td align="center">\r\n'
' *<strong>' # next line = group 1
'(North|South|East|West|Central|Overall Singapore)'
'</strong>\r\n'
' *</td>\r\n'
'((?: *<td align="center">\r\n' # group 2 start
' *[.\d-]+\r\n' #
' *</td>\r\n)*)' # group 2 end
' *<td align="center">\r\n'
' *<strong style[^>]+>'
'([.\d-]+)' # group 3
'</strong>\r\n'
' *</td>\r\n')
rgxline = re.compile(patline)
rgxnb = re.compile('<td align="center">\r\n'
' *([.\d-]+)\r\n'
' *</td>\r\n')
m= rgxtime.search(page)
a,b = m.span(1) # m.group(1) contains the data AM
d = dict((mat.group(1),
rgxnb.findall(mat.group(2))+[mat.group(3)])
for mat in rgxline.finditer(page[a:b]))
a,b = m.span(2) # m.group(2) contains the data PM
for mat in rgxline.finditer(page[a:b]):
d[mat.group(1)].extend(rgxnb.findall(mat.group(2))+[mat.group(3)])
print 'last 3 values'
for k,v in d.iteritems():
print '%s : %s' % (k,v[-3:])