Python Google Search: Hits within date range are inaccurate - python

I've been trying to write code to scrape the number of hits within a certain date range on google. I've done this by inserting the date into the google search query. When I copy and paste the link it produces, it gives me the correct query, but when the code runs it, I keep getting the number of hits for the search without the date range. I'm not sure what I'm doing wrong here.
from bs4 import BeautifulSoup
import requests
import re
from datetime import date, timedelta
day = date.today()
friday = day - timedelta(days=day.weekday() + 3) + timedelta(days=7)
word = "debt"
for n in range(0,32,7):
date_end = friday - timedelta(days=n)
date_beg = date_end - timedelta(days=4)
link_beg = "https://www.google.com/search?q=%s&source=lnt&tbs=cdr%%3A1%%2Ccd_min%%3A" % (word)
link_date = "%s%%2F%s%%2F%s%%2Ccd_max%%3A%s%%2F%s%%2F%s&tbm=&gws_rd=ssl" % (str(date_beg.month),str(date_beg.day),str(date_beg.year),str(date_end.month),str(date_end.day),str(date_end.year))
url = link_beg + link_date
print url,
print "\t",
r = requests.get(url)
soup = BeautifulSoup(r.content)
products = soup.findAll("div", id = "resultStats")
result = str(products[0])
results = re.findall(r'\d+', result)
number = ''.join([str(i) for i in results])
print number
For example, one of the links that is produced is this:
Google Search for "debt" in date range "3/9/2015 to 3/13/2015"
The hits produced should be: 39,700,000
But instead, it spits out: 293,000,000 (which is what just a generic search produces)

Google's date range limited search relies on Julian dates-- i.e. the range must be specified in Julian nomenclature. Perhaps you realized this already.
cute kitties daterange:[some Julian date]-[another Julian date] (without brackets).
There are web pages to convert to Julian, or use the jDate Python script or jday shell script.

Related

Using datetime to evaluate if a given time is already in the past

This is my current code:
import requests
import json
res = requests.get("http://transport.opendata.ch/v1/connections?
from=Baldegg_kloster&to=Luzern&fields[]=connections/from/prognosis/departure")
parsed_json = res.json()
time_1 = parsed_json['connections'][0]['from']['prognosis']
time_2 = parsed_json['connections'][1]['from']['prognosis']
time_3 = parsed_json['connections'][2]['from']['prognosis']
The JSON data looks like this:
{
"connections": [
{"from": {"prognosis": {"departure": "2018-08-04T14:21:00+0200"}}},
{"from": {"prognosis": {"departure": "2018-08-04T14:53:00+0200"}}},
{"from": {"prognosis": {"departure": "2018-08-04T15:22:00+0200"}}},
{"from": {"prognosis": {"departure": "2018-08-04T15:53:00+0200"}}}
]
}
Time_1, 2 and 3 all contain different times where the train departs. I want to check if time_1 is already in the past, and time_2 now is the relevant time. In my opinion, using datetime.now to get the current time and then using If / elif to check if time_1 is sooner than datetime.now would be a viable option. I am new to coding, so I am unsure if this is a good way of doing it. Would this work and are there any better ways?
PS: I am planning to make a display that displays the time the next train leaves. Therefore, it would have to check if the time is still relevant over and over again.
The following code extracts all the departure time strings from the JSON data, and converts the valid time strings to datetime objects. It then prints the current time, and then a list of the departure times that are still in the future.
Sometimes the converted JSON has None for a departure time, so we need to deal with that. And we need to get the current time as a timezone-aware object. We could just use the UTC timezone, but it's more convenient to use the local timezone from the JSON data.
import json
from datetime import datetime
import requests
url = "http://transport.opendata.ch/v1/connections? from=Baldegg_kloster&to=Luzern&fields[]=connections/from/prognosis/departure"
res = requests.get(url)
parsed_json = res.json()
# Extract all the departure time strings from the JSON data
time_strings = [d["from"]["prognosis"]["departure"]
for d in parsed_json["connections"]]
#print(time_strings)
# The format string to parse ISO 8601 date + time strings
iso_format = "%Y-%m-%dT%H:%M:%S%z"
# Convert the valid time strings to datetime objects
times = [datetime.strptime(ts, iso_format)
for ts in time_strings if ts is not None]
# Grab the timezone info from the first time
tz = times[0].tzinfo
# The current time, using the same timezone
nowtime = datetime.now(tz)
# Get rid of the microseconds
nowtime = nowtime.replace(microsecond=0)
print('Now', nowtime)
# Print the times that are still in the future
for i, t in enumerate(times):
if t > nowtime:
diff = t - nowtime
print('{}. {} departing in {}'.format(i, t, diff))
output
Now 2018-08-04 17:17:25+02:00
1. 2018-08-04 17:22:00+02:00 departing in 0:04:35
2. 2018-08-04 17:53:00+02:00 departing in 0:35:35
3. 2018-08-04 18:22:00+02:00 departing in 1:04:35
That query URL is a bit ugly, and not convenient if you want to check on other stations. It's better to let requests build the query URL for you from a dictionary of parameters. And we should also check that the request was successful, which we can do with the raise_for_status method.
Just replace the top section of the script with this:
import json
from datetime import datetime
import requests
endpoint = "http://transport.opendata.ch/v1/connections"
params = {
"from": "Baldegg_kloster",
"to": "Luzern",
"fields[]": "connections/from/prognosis/departure",
}
res = requests.get(endpoint, params=params)
res.raise_for_status()
parsed_json = res.json()
If you've never used enumerate before, it can be a little confusing at first. Here's a brief demo of three different ways to loop over a list of items and print each item and its index number.
things = ['zero', 'one', 'two', 'three']
for i, word in enumerate(things):
print(i, word)
for i in range(len(things)):
word = things[i]
print(i, word)
i = 0
while i < len(things):
word = things[i]
print(i, word)
i = i + 1
I didn't understand your question properly. I think you are trying to compare two time.
First let's see the contents of time_1:
{'departure': '2018-08-04T15:24:00+0200'}
So add departure key to access time. To parse the date and time string to python understandable time we use datetime.strptime() method. See this link for further description on datatime.strptime()
The modified version of your code that does time comparision:
import requests
import json
from datetime import datetime
res = requests.get("http://transport.opendata.ch/v1/connections? from=Baldegg_kloster&to=Luzern&fields[]=connections/from/prognosis/departure")
parsed_json = res.json()
time_1 = parsed_json['connections'][0]['from']['prognosis']['departure']
time_2 = parsed_json['connections'][1]['from']['prognosis']['departure']
time_3 = parsed_json['connections'][2]['from']['prognosis']['departure']
mod_time_1 = datetime.strptime(time_1,'%Y-%m-%dT%H:%M:%S%z')
mod_time_2 = datetime.strptime(time_2,'%Y-%m-%dT%H:%M:%S%z')
# you need to provide datetime.now() your timezone.
timezone = mod_time_1.tzinfo
time_now = datetime.now(timezone)
print(time_now > mod_time_1)

Python: Extract two dates from string

I have a string s which contains two dates in it and I am trying to extract these two dates in order to subtract them from each other to count the number of days in between. In the end I am aiming to get a string like this: s = "o4_24d_20170708_20170801"
At the company I work we can't install additional packages so I am looking for a solution using native python. Below is what I have so far by using the datetime package which only extracts one date: How can I get both dates out of the string?
import re, datetime
s = "o4_20170708_20170801"
match = re.search('\d{4}\d{2}\d{2}', s)
date = datetime.datetime.strptime(match.group(), '%Y%m%d').date()
print date
from datetime import datetime
import re
s = "o4_20170708_20170801"
pattern = re.compile(r'(\d{8})_(\d{8})')
dates = pattern.search(s)
# dates[0] is full match, dates[1] and dates[2] are captured groups
start = datetime.strptime(dates[1], '%Y%m%d')
end = datetime.strptime(dates[2], '%Y%m%d')
difference = end - start
print(difference.days)
will print
24
then, you could do something like:
days = 'd{}_'.format(difference.days)
match_index = dates.start()
new_name = s[:match_index] + days + s[match_index:]
print(new_name)
to get
o4_d24_20170708_20170801
import re, datetime
s = "o4_20170708_20170801"
match = re.findall('\d{4}\d{2}\d{2}', s)
for a_date in match:
date = datetime.datetime.strptime(a_date, '%Y%m%d').date()
print date
This will print:
2017-07-08
2017-08-01
Your regex was working correctly at regexpal

Compare WebElement with Timedelta

i have a HTML Table with 5 columns. In the third column are links and in the fifth column is a Date.
Now i want to write a code which check if the date is within the next 4 weeks and if yes, then click on the link
This is what i have so far:
# Set the Date
start = time.strftime('%d-%m-%Y')
now = datetime.datetime.strptime(start, '%d-%m-%Y')
date_in_four_weeks = now + datetime.timedelta(days=28)
project_time = date_in_four_weeks - now
# Check the date and click on link
for i in range(project_time.days + 1):
print(now + timedelta(days=i))
time = driver.find_element_by_css_selector('css_selector')
if time <= project_time:
linkList = driver.find_elements_by_css_selector("css_selector")
for i in range(0,len(linkList)):
links = driver.find_elements_by_partial_link_text('SOLI')
links[i].click()
driver.get_screenshot_as_file("test.png")
else:
print "No Project found"
If i run the code i get the error:
TypeError: can't compare datetime.timedelta to WebElement
Now i want to ask if there is any way how i can fix my problem?
Thanks :)
There's a few issues you have to address.
Firstly, you're comparing to a WebElement object, as the error helpfully points out. This includes the tags and such of the HTML element you're referencing. You first want to extract the text.
Then, you need to parse this text to convert it into a Python datetime or date object. A time object won't do because it only stores time, and not date. Since I don't know what format your HTML date is in, I'll just point you to the docs so you can see how the types work and have some idea of how to parse your data.
Finally, you'd still get an error because of trying to compare a timedelta object to a date or datetime object. A timedelta is a period of time, it doesn't relate to a specific date.
You could fix this by replacing
if time <= project_time:
With the same as from your print function:
if time <= now + timedelta(days=i)

Python - rename files incrementally based on julian day

Problem:
I have a bunch of files that were downloaded from an org. Halfway through their data directory the org changed the naming convention (reasons unknown). I am looking to create a script that will take the files in a directory and rename the file the same way, but simply "go back one day".
Here is a sample of how one file is named: org2015365_res_version.asc
What I need is logic to only change the year day (2015365) in this case to 2015364. This logic needs to span a few years so 2015001 would be 2014365.
I guess I'm not sure this is possible since its not working with the current date so using a module like datetime does not seem applicable.
Partial logic I came up with. I know it is rudimentary at best, but wanted to take a stab at it.
# open all files
all_data = glob.glob('/somedir/org*.asc')
# empty array to be appended to
day = []
year = []
# loop through all files
for f in all_data:
# get first part of string, renders org2015365
f_split = f.split('_')[0]
# get only year day - renders 2015365
year_day = f_split.replace(f_split[:3], '')
# get only day - renders 365
days = year_day.replace(year_day[0:4], '')
# get only year - renders 2015
day.append(days)
years = year_day.replace(year_day[4:], '')
year.append(years)
# convert to int for easier processing
day = [int(i) for i in day]
year = [int(i) for i in year]
if day == 001 & year == 2016:
day = 365
year = 2015
elif day == 001 & year == 2015:
day = 365
year = 2014
else:
day = day - 1
Apart from the logic above I also came across the function below from this post, I am not sure what would be the best way to combine that with the partial logic above. Thoughts?
import glob
import os
def rename(dir, pattern, titlePattern):
for pathAndFilename in glob.iglob(os.path.join(dir, pattern)):
title, ext = os.path.splitext(os.path.basename(pathAndFilename))
os.rename(pathAndFilename,
os.path.join(dir, titlePattern % title + ext))
rename(r'c:\temp\xx', r'*.doc', r'new(%s)')
Help me, stackoverflow. You're my only hope.
You can use datetime module:
#First argument - string like 2015365, second argument - format
dt = datetime.datetime.strptime(year_day,'%Y%j')
#Time shift
dt = dt + datetime.timedelta(days=-1)
#Year with shift
nyear = dt.year
#Day in year with shift
nday = dt.timetuple().tm_yday
Based on feedback from the community I was able to get the logic needed to fix the files downloaded from the org! The logic was the biggest hurdle. It turns out that the datetime module can be used, I need to read up more on that.
I combined the logic with the batch renaming using the os module, I put the code below to help future users who may have a similar question!
# open all files
all_data = glob.glob('/some_dir/org*.asc')
# loop through
for f in all_data:
# get first part of string, renders org2015365
f_split = f.split('_')[1]
# get only year day - renders 2015365
year_day = f_split.replace(f_split[:10], '')
# first argument - string 2015365, second argument - format the string to datetime
dt = datetime.datetime.strptime(year_day, '%Y%j')
# create a threshold where version changes its naming convention
# only rename files greater than threshold
threshold = '2014336'
th = datetime.datetime.strptime(threshold, '%Y%j')
if dt > th:
# Time shift - go back one day
dt = dt + datetime.timedelta(days=-1)
# Year with shift
nyear = dt.year
# Day in year with shift
nday = dt.timetuple().tm_yday
# rename files correctly
f_output = 'org' + str(nyear) + str(nday).zfill(3) + '_res_version.asc'
os.rename(f, '/some_dir/' + f_output)
else:
pass

Subtract or add time to web-scraped times

I'm working on a one-off script for myself to get sunset times for Friday and Saturday, in order to determine when Shabbat and Havdalah start. Now, I was able to scrape the times from timeanddate.com -- using BeautifulSoup -- and store them in a list. Unfortunately, I'm stuck with those times; what I would like to do is be able to subtract or add time to them. As Shabbat candle-lighting time is 18 minutes before sunset, I'd like to be able to take the given sunset time for Friday and subtract 18 minutes from it. Here is the code I have thus far:
import datetime
import requests
from BeautifulSoup import BeautifulSoup
# declare all the things here...
day = datetime.date.today().day
month = datetime.date.today().month
year = datetime.date.today().year
soup = BeautifulSoup(requests.get('http://www.timeanddate.com/worldclock/astronomy.html?n=43').text)
# worry not.
times = []
for row in soup('table',{'class':'spad'})[0].tbody('tr'):
tds = row('td')
times.append(tds[1].string)
#end for
shabbat_sunset = times[0]
havdalah_time = times[1]
So far, I'm stuck. The objects in times[] are shown to be BeautifulSoup NavigatableStrings, which I can't modify into ints (for obvious reasons). Any help would be appreciated, and thank you sososososo much.
EDIT
So, I used the suggestion of using mktime and making BeautifulSoup's string into a regular string. Now I'm getting an OverflowError: mktime out of range when I call mktime on shabbat...
for row in soup('table',{'class':'spad'})[0].tbody('tr'):
tds = row('td')
sunsetStr = "%s" % tds[2].text
sunsetTime = strptime(sunsetStr,"%H:%M")
shabbat = mktime(sunsetTime)
candlelighting = mktime(sunsetTime) - 18 * 60
havdalah = mktime(sunsetTime) + delta * 60
You should use the datetime.timedelta() function.
In example:
time_you_want = datetime.datetime.now() + datetime.timedelta(minutes = 18)
Also see here:
Python Create unix timestamp five minutes in the future
Shalom Shabbat
The approach I'd take is to parse the complete time into a normal representation - in Python world, this representation is the number of seconds since the Unix epoch, 1 Jan 1970 midnight. To do this, you also need to look at column 0. (Incidentally, tds[1] is the sunrise time, not what I think you want.)
See below:
#!/usr/bin/env python
import requests
from BeautifulSoup import BeautifulSoup
from time import mktime, strptime, asctime, localtime
soup = BeautifulSoup(requests.get('http://www.timeanddate.com/worldclock/astronomy.html?n=43').text)
# worry not.
(shabbat, havdalah) = (None, None)
for row in soup('table',{'class':'spad'})[0].tbody('tr'):
tds = row('td')
sunsetStr = "%s %s" % (tds[0].text, tds[2].text)
sunsetTime = strptime(sunsetStr, "%b %d, %Y %I:%M %p")
if sunsetTime.tm_wday == 4: # Friday
shabbat = mktime(sunsetTime) - 18 * 60
elif sunsetTime.tm_wday == 5: # Saturday
havdalah = mktime(sunsetTime)
print "Shabbat - 18 Minutes: %s" % asctime(localtime(shabbat))
print "Havdalah %s" % asctime(localtime(havdalah))
Second, help to help yourself: The 'tds' list is a list of BeautifulSoup.Tag. To get documentation on this object, open a Python terminal, type
import BeautifulSoup
help(BeautifulSoup.Tag)

Categories

Resources