Scrape only rows with today's date from table

Scrape only rows with today's date from table - python

I'm not able to filter the results of table[3] to only include rows that have today's date in them. I'm using this url as my data source:
http://tides.mobilegeographics.com/locations/3881.html
I can get all the data back, but my filtering isn't working. I get the entire range, 5 days back. I only want something like this: (current day)
Montauk Point, Long Island Sound, New York
41.0717° N, 71.8567° W
2014-03-13 12:37 PM EDT 0.13 feet Low Tide
2014-03-13 6:51 PM EDT Sunset
2014-03-13 7:13 PM EDT 2.30 feet High Tide
How can I get this and then calculate if the tide is moving in/out within next 40 minutes.
Thanks for helping.
My Code is:
import sre, urllib2, sys, BaseHTTPServer, datetime, re, time, pprint, smtplib
from bs4 import BeautifulSoup
from bs4.diagnose import diagnose
data = urllib2.urlopen('http://tides.mobilegeographics.com/locations/3881.html').read()
day = datetime.date.today().day
month = datetime.date.today().month
year = datetime.date.today().year
date = datetime.date.today()
soup = BeautifulSoup(data)
keyinfo = soup.find_all('h2')
str_date = datetime.date.today().strftime("%Y-%m-%d")
time_text = datetime.datetime.now() + datetime.timedelta(minutes = 20)
t_day = time_text.strftime("%Y-%m-%d")
tide_table = soup.find_all('table')[3]
pre = tide_table.findAll('pre')
dailytide = []
pattern = str_date
allmatches = re.findall(r'pattern', pre)
print allmatches
if allmatches:
print allmatches
else:
print "Match for " + str_date + " not found in data string \n" + datah

You don't need a regular expression, just split the contents of a pre tag and check if today's date is in the line:
import urllib2
import datetime
from bs4 import BeautifulSoup
URL = 'http://tides.mobilegeographics.com/locations/3881.html'
soup = BeautifulSoup(urllib2.urlopen(URL))
pre = soup.find_all('table')[3].find('pre').text
today = datetime.date.today().strftime("%Y-%m-%d")
for line in pre.split('\n'):
if today in line:
print line
prints:
2014-03-13 6:52 PM EDT Sunset
2014-03-13 7:13 PM EDT 2.30 feet High Tide

Related

NameError: name 'all_df' is not defined

#!/usr/bin/env python3
# -*- coding: utf-8 -*-
"""
Created on Mon Aug 14 2017
Modified on Wed Aug 16 2017
Author: Yanfei Wu
Get the past 500 S&P 500 stocks data
"""
from bs4 import BeautifulSoup
import requests
from datetime import datetime
import pandas as pd
import pandas_datareader.data as web
def get_ticker_and_sector(url='https://en.wikipedia.org/wiki/List_of_S%26P_500_companies'):
"""
get the s&p 500 stocks from Wikipedia:
https://en.wikipedia.org/wiki/List_of_S%26P_500_companies
---
return: a dictionary with ticker names as keys and sectors as values
"""
r = requests.get(url)
data = r.text
soup = BeautifulSoup(data, 'lxml')
# we only want to parse the first table of this wikipedia page
table = soup.find('table')
sp500 = {}
# loop over the rows and get ticker symbol and sector name
for tr in table.find_all('tr')[1:]:
tds = tr.find_all('td')
ticker = tds[0].text
sector = tds[3].text
sp500[ticker] = sector
return sp500
def get_stock_data(ticker, start_date, end_date):
""" get stock data from google with stock ticker, start and end dates """
data = web.DataReader(ticker, 'google', start_date, end_date)
return data
if __name__ == '__main__':
""" get the stock data from the past 5 years """
# end_date = datetime.now()
end_date = datetime(2017, 8, 14)
start_date = datetime(end_date.year - 5, end_date.month , end_date.day)
sp500 = get_ticker_and_sector()
sp500['SPY'] = 'SPY' # also include SPY as reference
print('Total number of tickers (including SPY): {}'.format(len(sp500)))
bad_tickers =[]
for i, (ticker, sector) in enumerate(sp500.items()):
try:
stock_df = get_stock_data(ticker, start_date, end_date)
stock_df['Name'] = ticker
stock_df['Sector'] = sector
if stock_df.shape[0] == 0:
bad_tickers.append(ticker)
#output_name = ticker + '_data.csv'
#stock_df.to_csv(output_name)
if i == 0:
all_df = stock_df
else:
all_df = all_df.append(stock_df)
except:
bad_tickers.append(ticker)
print(bad_tickers)
all_df.to_csv('./data/all_sp500_data_2.csv')
""" Write failed queries to a text file """
if len(bad_tickers) > 0:
with open('./data/failed_queries_2.txt','w') as outfile:
for ticker in bad_tickers:
outfile.write(ticker+'\n')

Your problem is in your try/except block. It is good style to always catch a specific exception, not just blindly throw except statements after a long block of code. The problem with this approach, as demonstrated in your problem, is that if you have an unrelated or unexpected error, you won't know about it. In this case, this is the exception I get from running your code:
NotImplementedError: data_source='google' is not implemented
I'm not sure what that means, but it looks like the pandas_datareader.data.DataReader docs have good information about how to use that DataReader correctly.

Python Generating Date Time as Sting with time increments to generate URL

To Repeatedly call a web URL which has time stamp in the end,
Example URL
'https://mywebApi/StartTime=2019-05-01%2000:00:00&&endTime=2019-05-01%2003:59:59'
StartTime=2019-05-01%2000:00:00
is URL representation of Time 2019-05-01 00:00:00
endTime=2019-05-01%2003:59:59
is URL representation of Time 2019-05-01 00:00:00
Requirement is to make repetitive calls , with 4 hour window.
While adding 4 hours, the date may change,
Is there a lean way to generate the URL String,
Some thing like
baseUrl = 'https://mywebApi/StartTime='
startTime = DateTime(2018-05-03 00:01:00)
terminationTime = DateTime(2019-05-03 00:05:00)
while (startTime < terminationTime):
endTime = startTime + hours(4)
url = baseUrl+str(startTime)+"endtime="+str(startTime)
# request get url
startTime = startTime + hours(1)

You can use Datetime.timedelta as well as the strftime function as follows:
from datetime import datetime, timedelta
baseUrl = 'https://mywebApi/StartTime='
startTime = datetime(year=2018, month=5, day=3, hour=0, minute=1, second=0)
terminationTime = datetime(year=2018, month=5, day=3, hour=3, minute=59, second=59)
while (startTime < terminationTime):
endTime = startTime + timedelta(hours=4)
url = baseUrl + startTime.strftime("%Y-%m-%d%20%H:%M:%S") + "endtime=" + endtime.strftime("%Y-%m-%d%20%H:%M:%S")
# request get url
startTime = endTime
The following link is useful https://www.guru99.com/date-time-and-datetime-classes-in-python.html or you can look at the official datetime documentation.
edit: using what u/John Gordan said to declare the initial dates

Filesize is unproportional

A webscraper written in Python extracts waterleveldata. One read per hour.
When written to a .txt-file using the code below each line is appended with datetime, thus each line takes up something like 20 characters.
Example: "01/01-2010 11:10,-32"
Using the code below results in a file containing data from 01/01-2010 00:10 to 28/02-2010 23:50 which equals something like 60 days. 60 days, with a reading per hour results in 1440 lines and approx. 30000 characters. Microsoft word, however, tell me the file contains 830000 characters on 42210 lines, which fits very well with an observed filesize of 893 kB.
Apparently some lines and characters are hidden somewhere. I cant seem to find them anywhere.
import requests
import time
totaldata =[]
filnavn='Vandstandsdata_Esbjerg_KDI_TEST_recode.txt'
file = open(filnavn,'w')
file.write("")
file.close()
from datetime import timedelta, date
from bs4 import BeautifulSoup
def daterange(start_date, end_date):
for n in range(int ((end_date - start_date).days)):
yield start_date + timedelta(n)
start_date = date(2010, 1, 1)
end_date = date(2010, 3, 1)
values=[]
datoer=[]
for single_date in daterange(start_date, end_date):
valuesTemp=[]
datoerTemp=[]
dato = single_date.strftime("%d-%m-%y")
url = "http://kysterne.kyst.dk/pages/10852/waves/showData.asp?targetDay="+dato+"&ident=6401&subGroupGuid=16410"
page = requests.get(url)
if page.status_code == 200:
soup = BeautifulSoup(page.content, 'html.parser')
dataliste = list(soup.find_all(class_="dataTable"))
#dataliste =list(dataliste.find_all('td'))
#dataliste =dataliste[0].getText()
print(url)
dataliste = str(dataliste)
dataliste = dataliste.splitlines()
dataliste = dataliste[6:] #18
#print(dataliste[0])
#print(dataliste[1])
for e in range (0,len(dataliste),4): #4
#print(dataliste[e])
datoerTemp.append(dataliste[e])
#print(" -------- \n")
for e in range (1,len(dataliste),4): #4
valuesTemp.append(dataliste[e])
for e in valuesTemp:
#print (e)
e=e[4:]
e=e[:-5]
values.append(e)
for e in datoerTemp:
#print (e)
e=e[4:]
e=e[:-5]
datoer.append(e)
file = open(filnavn,'a')
for i in range(0,len(datoer),6):
file.write(datoer[i]+","+values[i]+"\n")
print("- skrevet til fil\n")
file.close()
print("done")

Ah, heureka.
Seconds before posting this question I realized I forgot to reset the list.
I added:
datoer=[]
everything now works as intended.
The old code would write data from a given day and all data of all previous days, for each loop in the code.
I hope someone can use this newbie-experience.

Python strptime ValueError it seem as time data is empty

I'm sure I am missing something quite trivial here so I have the below code
date1 = re.findall('0000(.*)', date1.encode('utf-8'))
str1 = '-'.join(date1)
print str1
print type(str1)
dt = datetime.strptime(str1,"%B %d, %Y ")
and I get an error of
ValueError: time data '' does not match format '%B %d, %Y '
it seems as if str1 is empty so I checked it with
print str1
print type(str1)
and get the following results
October 24, 2014
<type 'str'>
I cant work out why it thinks str1 is empty any ideas?
Appended full code
from bs4 import BeautifulSoup
import wikipedia
import re
from datetime import datetime
acq = wikipedia.page('List_of_mergers_and_acquisitions_by_Google')
test = acq.html()
#print test
##html = acq.html()
soup = BeautifulSoup(test)
table = soup.find('table', {'class' : 'wikitable sortable'})
company = ""
date1 = ""
for row in table.findAll('tr'):
cells = row.findAll('td')
if len(cells) == 8:
date1 = cells[1].get_text()
company = cells[2].get_text()
##print date
date1 = re.findall('0000(.*)', date1)
str1 = ''.join(date1)
print str1
print type(str1)
dt = datetime.strptime(str1,"%B %d, %Y ")

How can I extract hours and minutes from a datetime.datetime object?

I am required to extract the time of the day from the datetime.datetime object returned by the created_at attribute, but how can I do that?
This is my code for getting the datetime.datetime object.
from datetime import *
import tweepy
consumer_key = ''
consumer_secret = ''
access_token = ''
access_secret = ''
auth = tweepy.OAuthHandler(consumer_key, consumer_secret)
auth.set_access_token(access_token, access_secret)
api = tweepy.API(auth)
tweets = tweepy.Cursor(api.home_timeline).items(limit = 2)
t1 = datetime.strptime('Wed Jun 01 12:53:42 +0000 2011', '%a %b %d %H:%M:%S +0000 %Y')
for tweet in tweets:
print (tweet.created_at - t1)
t1 = tweet.created_at
I need to only extract the hour and minutes from t1.

I don't know how you want to format it, but you can do:
print("Created at %s:%s" % (t1.hour, t1.minute))
for example.

If the time is 11:03, then afrendeiro's answer will print 11:3.
You could zero-pad the minutes:
"Created at {:d}:{:02d}".format(tdate.hour, tdate.minute)
Or go another way and use tdate.time() and only take the hour/minute part:
str(tdate.time())[0:5]

import datetime
YEAR = datetime.date.today().year # the current year
MONTH = datetime.date.today().month # the current month
DATE = datetime.date.today().day # the current day
HOUR = datetime.datetime.now().hour # the current hour
MINUTE = datetime.datetime.now().minute # the current minute
SECONDS = datetime.datetime.now().second #the current second
print(YEAR, MONTH, DATE, HOUR, MINUTE, SECONDS)
2021 3 11 19 20 57

It's easier to use the timestamp for these things since Tweepy gets both:
import datetime
print(datetime.datetime.fromtimestamp(int(t1)).strftime('%H:%M'))

datetime has fields hour and minute. So to get the hours and minutes, you would use t1.hour and t1.minute.
However, when you subtract two datetimes, the result is a timedelta, which only has the days and seconds fields. So you'll need to divide and multiply as necessary to get the numbers you need.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Scrape only rows with today's date from table - python

Related

NameError: name 'all_df' is not defined

Python Generating Date Time as Sting with time increments to generate URL

Filesize is unproportional

Python strptime ValueError it seem as time data is empty

How can I extract hours and minutes from a datetime.datetime object?

Categories

Resources