Parse the HTML Table - python

I have an HTML table that I need to parse into a CSV file.
import urllib2, datetime
olddate = datetime.datetime.strptime('5/01/13', "%m/%d/%y")
from BeautifulSoup import BeautifulSoup
print("dates,location,name,url")
def genqry(arga,argb,argc,argd):
return arga + "," + argb + "," + argc + "," + argd
part = 1
row = 1
contenturl = "http://www.robotevents.com/robot-competitions/vex-robotics-competition"
soup = BeautifulSoup(urllib2.urlopen(contenturl).read())
table = soup.find('table', attrs={'class': 'catalog-listing'})
rows = table.findAll('tr')
for tr in rows:
try:
if row != 1:
cols = tr.findAll('td')
for td in cols:
if part == 1:
keep = 0
dates = td.find(text=True)
part = 2
if part == 2:
location = td.find(text=True)
part = 2
if part == 3:
name = td.find(text=True)
for a in tr.findAll('a', href=True):
url = a['href']
# Compare Dates
if len(dates) < 6:
newdate = datetime.datetime.strptime(dates, "%m/%d/%y")
if newdate > olddate:
keep = 1
else:
keep = 0
else:
newdate = datetime.datetime.strptime(dates[:6], "%m/%d/%y")
if newdate > olddate:
keep = 1
else:
keep = 0
if keep == 1:
qry = genqry(dates, location, name, url)
print(qry)
row = row + 1
part = 1
else:
row = row + 1
except (RuntimeError, TypeError, NameError):
print("Error: " + name)
I need to be able to get every VEX Event in that table that is after 5/01/13. So far, this code gives me an error about the dates, that I can't seem to be able to fix. Maybe someone that is better than me can fix this code? Thanks in advance, Smith.
EDIT #1: The Error That I am Getting Is:
Value Error: '\n10/5/13' does not match format '%m/%d/%y'
I think that I need to remove newlines at the beginning of the string first.
EDIT #2: Got it to run, without any output, any help?

Your question is very poor. Without knowing what the exact error, I would guess the problem is with your if len(dates) < 6: block. Consider the following:
>>> date = '10/5/13 - 12/14/13'
>>> len(date)
18
>>> date = '11/9/13'
>>> len(date)
7
>>> date[:6]
'11/9/1'
One suggestion to make your code more Pythonic: Instead of doing row = row + 1, use enumerate.
Update: Tracing your code, I get the value of dates as follows:
>>> dates
u'\n10/5/13 - 12/14/13 \xa0\n '

Related

looping through value list for multiple url requests in python

I am trying to scrape Weather Underground for years' worth of hourly data from multiple weather stations and put it into a pandas dataframe. I CANNOT use the API as there are limits on requests and I don't want to pay thousands of dollars to scrape this data.
I can get the script to scrape all of the data I want from one station. When I try to modify it so it loops through a list of stations I either get a 406 error or it returns only the data from the first station in my list. How can I loop through all the stations? Also how can I store the station name so that it can be added to the dataframe in another column?
here is what my code looks like now:
stations = ['EGMC','KSAT','CAHR']
weather_data = []
date = []
for s in stations:
for y in range(2014,2015):
for m in range(1, 13):
for d in range(1, 32):
#check if a leap year
if y%400 == 0:
leap = True
elif y%100 == 0:
leap = False
elif y%4 == 0:
leap = True
else:
leap = False
#check to see if dates have already been scraped
if (m==2 and leap and d>29):
continue
elif (y==2013 and m==2 and d > 28):
continue
elif(m in [4, 6, 9, 11] and d > 30):
continue
timestamp = str(y) + str(m) + str(d)
print ('Getting data for ' + timestamp)
#pull URL
url = 'http://www.wunderground.com/history/airport/{0}/' + str(y) + '/' + str(m) + '/' + str(d) + '/DailyHistory.html?HideSpecis=1'.format(stations)
page = urlopen(url)
#find the correct piece of data on the page
soup = BeautifulSoup(page, 'lxml')
for row in soup.select("table tr.no-metars"):
date.append(str(y) + '/' + str(m) + '/' + str(d))
cells = [cell.text.strip().encode('ascii', 'ignore').decode('ascii') for cell in row.find_all('td')]
weather_data.append(cells)
weather_datadf = pd.DataFrame(weather_data)
datedf = pd.DataFrame(date)
result = pd.concat([datedf, weather_datadf], axis=1)
result
Here is explanation of your error https://httpstatuses.com/406
You should add User-Agent to headers. But I think on this site exist some protection from crawling and you should use more specific things like Scrapy, Crawlera, proxy-list, user-agent rotator

Python loop append replace past variable

I have code :
fileAspect = []
id_filename = -1
for filename in glob.glob(os.path.join(path, '*.dat')):
id_filename += 1
f = open(filename, 'r')
content = f.read()
soup = BeautifulSoup(content, 'html.parser')
all_paragraph = (soup.find_all('content'))
result_polarity = []
for all_sentences in all_paragraph:
sentences = nlp.parse(all_sentences.get_text())
for sentence in sentences['sentences']:
for words in sentence['dependencies']:
if(words[1]<>"ROOT"):
result_proccess = proccess1(myOntology, words) if(result_proccess):
result_polarity.append(words)
ini+=1
else:
out+=1
ont = ''
ont = myOntology
for idx_compare1 in range(len(ont)):
for idy in range(len(result_polarity)):
for idw in range(len(result_polarity[idy])):
if(result_polarity[idy][1]<>"ROOT" and idw<>0):
try:
closest = model.similarity(ont[idx_compare1][0], result_polarity[idy][idw])
if closest >= 0.1 :
valpolar = valuePol(result_polarity[idy][idw])
if(valpolar==1):
ont[idx_compare1][2]+=1
elif(valpolar==2):
ont[idx_compare1][3]+=1
tp += 1
break
except Exception as inst:
fn += 1
break
print "^^^^^^^^^^^^^"
print id_filename
print ont
fileAspect.append(ont)
print 'overall'
#print fileAspect
for aspek in fileAspect:
print aspek
print "---"
Why the result is replace previous data
sample i hope :
a = []
a.append('q')
a.append('w')
a.append('e')
print a
the result is :
['q','w','e']
but with my previous code i got :
['e','e','e']
so the problem is, when i append in variable array "fileAspect", it's didn't only append but also replace the previous code. i use simple code and it's fine, but when it's come in my code, i very confused.. more than 1 hour i try but..
Thanks, any help i'am very apreciated..

Why is my code looping at just one line in the while loop instead over the whole block?

Sorry for the unsophisticated question title but I need help desperately:
My objective at work is to create a script that pulls all the records from exacttarget salesforce marketing cloud API. I have successfully setup the API calls, and successfully imported the data into DataFrames.
The problem I am running into is two-fold that I need to keep pulling records till "Results_Message" in my code stops reading "MoreDataAvailable" and I need to setup logic which allows me to control the date from either within the API call or from parsing the DataFrame.
My code is getting stuck at line 44 where "print Results_Message" is looping around the string "MoreDataAvailable"
Here is my code so far, on lines 94 and 95 you will see my attempt at parsing the date directly from the dataframe but no luck and no luck on line 32 where I have specified the date:
import ET_Client
import pandas as pd
AggreateDF = pd.DataFrame()
Data_Aggregator = pd.DataFrame()
#Start_Date = "2016-02-20"
#End_Date = "2016-02-25"
#retrieveDate = '2016-07-25T13:00:00.000'
Export_Dir = 'C:/temp/'
try:
debug = False
stubObj = ET_Client.ET_Client(False, debug)
print '>>>BounceEvents'
getBounceEvent = ET_Client.ET_BounceEvent()
getBounceEvent.auth_stub = stubObj
getBounceEvent.search_filter = {'Property' : 'EventDate','SimpleOperator' : 'greaterThan','Value' : '2016-02-22T13:00:00.000'}
getResponse1 = getBounceEvent.get()
ResponseResultsBounces = getResponse1.results
Results_Message = getResponse1.message
print(Results_Message)
#EventDate = "2016-05-09"
print "This is orginial " + str(Results_Message)
#print ResponseResultsBounces
i = 1
while (Results_Message == 'MoreDataAvailable'):
#if i > 5: break
print Results_Message
results1 = getResponse1.results
#print(results1)
i = i + 1
ClientIDBounces = []
partner_keys1 = []
created_dates1 = []
modified_date1 = []
ID1 = []
ObjectID1 = []
SendID1 = []
SubscriberKey1 = []
EventDate1 = []
EventType1 = []
TriggeredSendDefinitionObjectID1 = []
BatchID1 = []
SMTPCode = []
BounceCategory = []
SMTPReason = []
BounceType = []
for BounceEvent in ResponseResultsBounces:
ClientIDBounces.append(str(BounceEvent['Client']['ID']))
partner_keys1.append(BounceEvent['PartnerKey'])
created_dates1.append(BounceEvent['CreatedDate'])
modified_date1.append(BounceEvent['ModifiedDate'])
ID1.append(BounceEvent['ID'])
ObjectID1.append(BounceEvent['ObjectID'])
SendID1.append(BounceEvent['SendID'])
SubscriberKey1.append(BounceEvent['SubscriberKey'])
EventDate1.append(BounceEvent['EventDate'])
EventType1.append(BounceEvent['EventType'])
TriggeredSendDefinitionObjectID1.append(BounceEvent['TriggeredSendDefinitionObjectID'])
BatchID1.append(BounceEvent['BatchID'])
SMTPCode.append(BounceEvent['SMTPCode'])
BounceCategory.append(BounceEvent['BounceCategory'])
SMTPReason.append(BounceEvent['SMTPReason'])
BounceType.append(BounceEvent['BounceType'])
df1 = pd.DataFrame({'ClientID': ClientIDBounces, 'PartnerKey': partner_keys1,
'CreatedDate' : created_dates1, 'ModifiedDate': modified_date1,
'ID':ID1, 'ObjectID': ObjectID1,'SendID':SendID1,'SubscriberKey':SubscriberKey1,
'EventDate':EventDate1,'EventType':EventType1,'TriggeredSendDefinitionObjectID':TriggeredSendDefinitionObjectID1,
'BatchID':BatchID1,'SMTPCode':SMTPCode,'BounceCategory':BounceCategory,'SMTPReason':SMTPReason,'BounceType':BounceType})
#print df1
#df1 = df1[(df1.EventDate > "2016-02-20") & (df1.EventDate < "2016-02-25")]
#AggreateDF = AggreateDF[(AggreateDF.EventDate > Start_Date) and (AggreateDF.EventDate < End_Date)]
print(df1['ID'].max())
AggreateDF = AggreateDF.append(df1)
print(AggreateDF.shape)
#df1 = df1[(df1.EventDate > "2016-02-20") and (df1.EventDate < "2016-03-25")]
#AggreateDF = AggreateDF[(AggreateDF.EventDate > Start_Date) and (AggreateDF.EventDate < End_Date)]
print("Final Aggregate DF is: " + str(AggreateDF.shape))
#EXPORT TO CSV
AggreateDF.to_csv(Export_Dir +'DataTest1.csv')
#with pd.option_context('display.max_rows',10000):
#print (df_masked1.shape)
#print df_masked1
except Exception as e:
print 'Caught exception: ' + str(e.message)
print e
Before my code parses the data, the orginal format I get of the data is a SOAP response, this is what it look like(below). Is it possible to directly parse records based on EventDate from the SOAP response?
}, (BounceEvent){
Client =
(ClientID){
ID = 1111111
}
PartnerKey = None
CreatedDate = 2016-05-12 07:32:20.000937
ModifiedDate = 2016-05-12 07:32:20.000937
ID = 1111111
ObjectID = "1111111"
SendID = 1111111
SubscriberKey = "aaa#aaaa.com"
EventDate = 2016-05-12 07:32:20.000937
EventType = "HardBounce"
TriggeredSendDefinitionObjectID = "aa111aaa"
BatchID = 1111111
SMTPCode = "1111111"
BounceCategory = "Hard bounce - User Unknown"
SMTPReason = "aaaa"
BounceType = "immediate"
Hope this makes sense, this is my desperately plea for help.
Thank you in advance!
You don't seem to be updating Results_Message in your loop, so it's always going to have the value it gets in line 29: Results_Message = getResponse1.message. Unless there's code involved that you didn't share, that is.

Convert str to float explicitly in Python 3

I am gettting a "TypeError: Can't convert 'float' object to str implicitly" error because I am trying to divide a float by a string.
I am trying to cast the string to a float, but am still getting an error.
The string 'empNumber', is all digits but has a comma (ex: 112,000) - hence the "replace" function to strip the comma. I am drawing an error when I try to divide "final/decimal". How can I fix this type error?
def revPerEmployee():
for ticker in sp500short:
searchurl = "http://finance.yahoo.com/q/ks?s="+ticker
f = urlopen(searchurl)
html = f.read()
soup = BeautifulSoup(html, "html.parser")
searchurlemp = "http://finance.yahoo.com/q/pr?s="+ticker+"+Profile"
femp = urlopen(searchurlemp)
htmlemp = femp.read()
soupemp = BeautifulSoup(htmlemp, "html.parser")
try:
revenue2 = soup.find("td", text="Revenue (ttm):").find_next_sibling("td").text
empCount2 = soupemp.find("td", text="Full Time Employees:").find_next_sibling("td").text
except:
revenue2 = "There is no data for this company"
empCount2 = "There is no data for this company"
if revenue2 == "There is no data for this company" or empCount2 == "There is no data for this company":
lastLetter = ticker+": There is no data for this company"
else:
lastLetter = revenue2[len(revenue2)-1:len(revenue2)]
empNumber = empCount2.replace(",", "")
decimal = float(empNumber)
if lastLetter == "B":
result = revenue2[:-1]
revNum = float(result)
final = revNum * 1000000000.0
revPerEmp = final/decimal
print(ticker+": "+revPerEmp)
elif lastLetter == "M":
result = revenue2[:-1]
revNum = float(result)
final = revNum * 1000000.0
#newnum = "{:0,.2f}".format(final)
revPerEmp = final/decimal
print(ticker+": "+revPerEmp)
elif lastLetter == "K":
result = revenue2[:-1]
revNum = float(result)
final = revNum * 1000.0
#newnum = "{:0,.2f}".format(final)
revPerEmp = final/decimal
print(ticker+": "+revPerEmp)
else:
print(lastLetter)
17 + "orange" is nonsense, you can't add numbers and strings. You want
print("%s: %s" % (ticker, revPerEmp))
(you can switch %s for other formats, like %.2f), or
print(str(ticker) + ": " + str(revPerEmp))
The problem is that your program assumes that what is obtained from the URL request is a number in the form of digits followed by a suffix (K, M or B). This is not tested for.
There are also two suggestions to improve your code. First, you do use a try ... except clause to check when data cannot be obtained. You can also use it if conversion fails. The message "There is no data for this company" could be printed in the except clause.
Second, you have three if clauses very much alike, suggesting they can be condensed. A python dictionary can be used for the suffix values.
SUFFIX_VALUES = { 'K': 1000.0, 'M': 1000000.0, 'B': 1000000000.0 }
try:
# taken from your code
revenue2 = soup.find("td", text="Revenue(ttm):").find_next_sibling("td").text
empCount2 = soupemp.find("td", text="Full Time Employees:").find_next_sibling("td").text
revNum = float(revenue2[:-1])
empNumber = empCount2.replace(",", "")
decimal = float(empNumber)
lastLetter = revenue2[-1]
final = revNum * SUFFIX_VALUES[lastLetter]
revPerEmp = final/decimal
print("%s: %d" % (ticker, revPerEmp))
except:
print(ticker + ": There is no data for this company")
Now, if data is missing from the URL request, if conversion fails, or if the suffix is wrong, the program will execute the except clause.

SyntaxError related to line.rfind

In the past I was using line.rfind to find a fixed variable and my script worked fine. However, now that I am trying to use line.rfind to find a changing variable, I am getting a syntax error for a line of code that used to work. Here is the code I have.
#!usr/bin/env python
import urllib
from datetime import datetime
from datetime import timedelta
date = datetime.now()
date1 = date + timedelta(days=1)
class city :
def __init__(self, city_name, link, latitude, longitude) :
self.name = city_name
self.url = link
self.low0 = 0
self.high1 = 0
self.high2 = 0
self.low1 = 0
self.low2 = 0
self.lat = latitude
self.long = longitude
def retrieveTemps(self) :
filehandle = urllib.urlopen(self.url)
# get lines from result into array
lines = filehandle.readlines()
# (for each) loop through each line in lines
line_number = 0 # a counter for line number
for line in lines:
line_number = line_number + 1 # increment counter
# find string, position otherwise position is -1
position0 = line.rfind('title="{}"'.format(date1.strftime("%A"))
# string is found in line
if position0 > 0 :
self.low0 = lines[line_number + 4].split('&')[0].split('>')[-1]
The error I am getting says...
if position0 > 0 :
^
SyntaxError: invalid syntax
Any ideas on what is wrong? I assume it is somehow related to the change I made in this line...
position0 = line.rfind('title="{}"'.format(date1.strftime("%A"))
Thank you for your help!
You simply forgot to use closing bracet ')'. Change to:
position0 = line.rfind('title="{}"'.format(date1.strftime("%A")))

Categories

Resources