I need to be able to use function (get_years) to iterate a list of reviews such as:
{'rating': 5.0,
'reviewer_name': 'Karen',
'product_id': 'B00004RFRV',
'review_title': 'Bialetti is the Best!',
'review_time': '11 12, 2017',
'images': ['https://images-na.ssl-images-amazon.com/images/I/81+XxFRGyBL._SY88.jpg'],
'styles': {'Size:': ' 12-Cup', 'Color:': ' Silver'}}```
{'rating': 3.0,
'reviewer_name': 'Peter DP',
'product_id': 'B00005OTXM',
'review_title': "Mr. Coffee DWX23 12-cup doesn't have the quality feel as my 13 year old nearly identical 12-cup Mr. Coffee",
'review_time': '04 17, 2015',
'images': ['https://images-na.ssl-images-amazon.com/images/I/71sFKwTW9sL._SY88.jpg'],
'styles': {'Style Name:': ' COFFEE MAKER ONLY'}}
{'rating': 5.0,
'reviewer_name': 'B. Laska',
'product_id': 'B00004RFRV',
'review_title': 'Love my Moka pots!',
'review_time': '07 9, 2015',
'images': ['https://images-na.ssl-images-amazon.com/images/I/719NCqw4GML._SY88.jpg'],
'styles': {'Size:': ' 1-Cup', 'Color:': ' Silver'}}
to be able to return:
print(get_years(reviews)) # [2007, 2008, 2009, 2011, 2012, 2013, 2014, 2015, 2016, 2017, 2018]
print(type(get_years(reviews))) # <class 'list'>
I have:
def get_years(review):
review_years_set = set()
for review in reviews:
review_years_set.add(review['review_time'][-4:])
review_years_list = list(review_years_set)
review_years_list.sort()
return review_years_list
which gives me what I want but it seems like the longer route. Is there a more Pythonic or efficient way to get a sorted list of set values?
Given an iterable of string-formatted dates, e.g.:
dates = ['07 9, 2007', '04 1, 2008', '01 2, 2007', '08 2, 2014', '01 3, 2004', '01 4, 2004']
A concise way to produce a sorted list of unique years is as follows using set comprehension:
sorted_dates = sorted({int(date[-4:]) for date in dates})
print(sorted_dates)
Output:
[2004, 2007, 2008, 2014]
try this.
def get_years(reviews):
return sorted([review['review_time'][-4:] for review in reviews])
print(get_years(reviews))
I've only tried
from datetime import datetime
my_dates = ['5-Nov-18', '25-Mar-17', '1-Nov-18', '7 Mar 17']
my_dates.sort(key=lambda date: datetime.strptime(date, "%d-%b-%y"))
print(my_dates)
But how can I make this work for date formats like
my_dates = ['05 Nov 2018', '25 Mar 2017', '01 Nov 2018', '07 Mar 2017']
One inelegant solution which comes to mind is replacing all spaces with dashes as shown below:
from datetime import datetime
my_dates = ['5-Nov-18', '25-Mar-17', '1-Nov-18', '7 Mar 17']
my_dates.sort(key=lambda date: datetime.strptime(date.replace(' ', '-'), "%d-%b-%y"))
print(my_dates)
If all you're looking is the specific set of dates you have provided, just change up the format in strptime():
my_dates = ['05 Nov 2018', '25 Mar 2017', '01 Nov 2018', '07 Mar 2017']
my_dates.sort(key=lambda date: datetime.strptime(date, "%d %b %Y"))
# Change the last %y to %Y
However, if you have varying date formats in your list, you could prepare a list of possibly anticipated string formats and define your own function to parse against each format:
def func(date, formats):
for frmt in formats:
try:
str_date = datetime.strptime(date, frmt)
return str_date
except ValueError:
continue
# might want to consider handling a scenario when nothing is returned
my_formats = ['%d-%b-%y', '%d %b %y', '%d %b %Y']
my_dates = ['5-Nov-18', '25-Mar-17', '1-Nov-18', '7 Mar 17', '05 Nov 2018', '25 Mar 2017', '01 Nov 2018', '07 Mar 2017']
my_dates.sort(key=lambda date: func(date, my_formats))
print(my_dates)
# ['7 Mar 17', '07 Mar 2017', '25-Mar-17', '25 Mar 2017', '1-Nov-18', '01 Nov 2018', '5-Nov-18', '05 Nov 2018']
The caveat here is if an unanticipated date format shows up, the function will return None so it won't be sorted properly. If that's a concern, you might want to add some handling at the end of the func() when all parsing attempts failed. Some devs might also say to avoid try...except..., but I can only come up with this way.
dates can be sorted once it's a datetime object
my_dates = ['05 Nov 2018', '25 Mar 2017', '01 Nov 2018', '07 Mar 2017']
my_dates = [ dt.strptime(date, '%d %b %Y') for date in my_dates ]
print(my_dates)
# [datetime.datetime(2018, 11, 5, 0, 0), datetime.datetime(2017, 3, 25, 0, 0), datetime.datetime(2018, 11, 1, 0, 0), datetime.datetime(2017, 3, 7, 0, 0)]
my_dates.sort()
print(my_dates)
# [datetime.datetime(2017, 3, 7, 0, 0), datetime.datetime(2017, 3, 25, 0, 0), datetime.datetime(2018, 11, 1, 0, 0), datetime.datetime(2018, 11, 5, 0, 0)]
my_dates = [ date.strftime('%d %b %Y') for date in my_dates ]
print(my_dates)
# ['07 Mar 2017', '25 Mar 2017', '01 Nov 2018', '05 Nov 2018']
Could someone please help me understand why the output for the below code is missing the first row of the table? I'm new to python, and not for lack of trying, I haven't been able to troubleshoot this myself.
import requests
import csv
from collections import OrderedDict
from bs4 import BeautifulSoup as soup
from urllib.request import urlopen as uReq
def printfunction():
with open("C:/Users/.../audusd.csv", 'a', newline='') as f:
wr = csv.writer(f)
wr.writerows([(data[0], data[1], data[2], data[3], data[4], data[5])])
url = requests.get("https://au.investing.com/currencies/aud-usd-historical-data/",
headers={'User-Agent': 'Mozilla/5.0'})
od = OrderedDict()
content_page = soup(url.content, 'html.parser')
table = content_page.find('table', {'class': 'genTbl closedTbl historicalTbl'})
cols = [th.text for th in table.select("th")[1:]]
for row in table.select("tr + tr"):
data = [td.text for td in row.select("td")]
printfunction()
print(data)
Output appears thus:
['Aug 23, 2018', '0.7246', '0.7349', '0.7355', '0.7240', '-1.37%']
['Aug 22, 2018', '0.7347', '0.7370', '0.7371', '0.7332', '-0.33%']
['Aug 21, 2018', '0.7371', '0.7341', '0.7383', '0.7332', '0.42%']
['Aug 20, 2018', '0.7340', '0.7306', '0.7344', '0.7294', '0.44%']
['Aug 19, 2018', '0.7308', '0.7316', '0.7317', '0.7308', '-0.05%']
['Aug 17, 2018', '0.7312', '0.7261', '0.7321', '0.7253', '0.70%']
['Aug 16, 2018', '0.7261', '0.7240', '0.7288', '0.7222', '0.30%']
['Aug 15, 2018', '0.7239', '0.7247', '0.7249', '0.7202', '-0.08%']
['Aug 14, 2018', '0.7245', '0.7270', '0.7284', '0.7222', '-0.33%']
['Aug 13, 2018', '0.7269', '0.7289', '0.7300', '0.7248', '-0.25%']
['Aug 12, 2018', '0.7287', '0.7278', '0.7300', '0.7273', '-0.21%']
['Aug 10, 2018', '0.7302', '0.7372', '0.7381', '0.7279', '-0.95%']
['Aug 09, 2018', '0.7372', '0.7435', '0.7456', '0.7371', '-0.81%']
['Aug 08, 2018', '0.7432', '0.7420', '0.7440', '0.7382', '0.15%']
['Aug 07, 2018', '0.7421', '0.7386', '0.7440', '0.7379', '0.46%']
['Aug 06, 2018', '0.7387', '0.7398', '0.7406', '0.7372', '-0.09%']
['Aug 05, 2018', '0.7394', '0.7397', '0.7400', '0.7394', '-0.08%']
['Aug 03, 2018', '0.7400', '0.7359', '0.7412', '0.7346', '0.54%']
['Aug 02, 2018', '0.7360', '0.7405', '0.7413', '0.7354', '-0.59%']
['Aug 01, 2018', '0.7404', '0.7427', '0.7430', '0.7389', '-0.34%']
['Jul 31, 2018', '0.7429', '0.7408', '0.7442', '0.7402', '0.30%']
['Jul 30, 2018', '0.7407', '0.7390', '0.7416', '0.7387', '0.12%']
['Jul 29, 2018', '0.7398', '0.7400', '0.7406', '0.7398', '-0.04%']
['Jul 27, 2018', '0.7401', '0.7377', '0.7416', '0.7369', '0.31%']
['Jul 26, 2018', '0.7378', '0.7456', '0.7464', '0.7370', '-1.03%']
['Jul 25, 2018', '0.7455', '0.7424', '0.7466', '0.7391', '0.46%']
Desired output (as per source table):
['Aug 24, 2018', 'x', 'x', 'x', 'x', 'x']
['Aug 23, 2018', '0.7246', '0.7349', '0.7355', '0.7240', '-1.37%']
['Aug 22, 2018', '0.7347', '0.7370', '0.7371', '0.7332', '-0.33%']
['Aug 21, 2018', '0.7371', '0.7341', '0.7383', '0.7332', '0.42%']
['Aug 20, 2018', '0.7340', '0.7306', '0.7344', '0.7294', '0.44%']
['Aug 19, 2018', '0.7308', '0.7316', '0.7317', '0.7308', '-0.05%']
['Aug 17, 2018', '0.7312', '0.7261', '0.7321', '0.7253', '0.70%']
['Aug 16, 2018', '0.7261', '0.7240', '0.7288', '0.7222', '0.30%']
['Aug 15, 2018', '0.7239', '0.7247', '0.7249', '0.7202', '-0.08%']
['Aug 14, 2018', '0.7245', '0.7270', '0.7284', '0.7222', '-0.33%']
['Aug 13, 2018', '0.7269', '0.7289', '0.7300', '0.7248', '-0.25%']
['Aug 12, 2018', '0.7287', '0.7278', '0.7300', '0.7273', '-0.21%']
['Aug 10, 2018', '0.7302', '0.7372', '0.7381', '0.7279', '-0.95%']
['Aug 09, 2018', '0.7372', '0.7435', '0.7456', '0.7371', '-0.81%']
['Aug 08, 2018', '0.7432', '0.7420', '0.7440', '0.7382', '0.15%']
['Aug 07, 2018', '0.7421', '0.7386', '0.7440', '0.7379', '0.46%']
['Aug 06, 2018', '0.7387', '0.7398', '0.7406', '0.7372', '-0.09%']
['Aug 05, 2018', '0.7394', '0.7397', '0.7400', '0.7394', '-0.08%']
['Aug 03, 2018', '0.7400', '0.7359', '0.7412', '0.7346', '0.54%']
['Aug 02, 2018', '0.7360', '0.7405', '0.7413', '0.7354', '-0.59%']
['Aug 01, 2018', '0.7404', '0.7427', '0.7430', '0.7389', '-0.34%']
['Jul 31, 2018', '0.7429', '0.7408', '0.7442', '0.7402', '0.30%']
['Jul 30, 2018', '0.7407', '0.7390', '0.7416', '0.7387', '0.12%']
['Jul 29, 2018', '0.7398', '0.7400', '0.7406', '0.7398', '-0.04%']
['Jul 27, 2018', '0.7401', '0.7377', '0.7416', '0.7369', '0.31%']
['Jul 26, 2018', '0.7378', '0.7456', '0.7464', '0.7370', '-1.03%']
['Jul 25, 2018', '0.7455', '0.7424', '0.7466', '0.7391', '0.46%']
Many thanks!
OM.
The selector tr + tr means "a tr that's after a tr". So the first row doesn't show up because you're specifically asking for it not to show up. If you want to select all the rows, just select plain tr.
If you don't know how selectors work, and just copied this from some other code that seemed close, read the docs.
If you were trying to do this because there's a tr inside the th and you wanted to skip that one, this is not the way to do it.
You could try to come up with a complicated selector for every tr that's either before or after another tr (and hope you never run into a one-row table…), or something like that.
But, more simply, just select every tr that's inside a tbody.
for row in table.select('tbody tr'):
… or directly inside:
for row in table.select('tbody > tr'):
Or just select all the rows inside the tbody inside of inside the table:
for row in table.tbody.select('tr'):
I need to extract dates from strings using regex in python and the dates can be in one of many formats, and between some random text.
The date formats are:
04/20/2009; 04/20/09; 4/20/09; 4/3/09
Mar-20-2009; Mar 20, 2009; March 20, 2009; Mar. 20, 2009; Mar 20 2009;
20 Mar 2009; 20 March 2009; 20 Mar. 2009; 20 March, 2009
Mar 20th, 2009; Mar 21st, 2009; Mar 22nd, 2009
Feb 2009; Sep 2009; Oct 2010
6/2008; 12/2009
2009; 2010
After extract the dates I need to sort them ascending.
I've tried to use those 6 regex patterns but it seems that it's not doing all the job.
pattern1 = r'((?:\d{1,2}[- ,./]*)(?:Jan|Feb|Mar|Apr|May|Jun|Jul|Aug|Sep|Oct|Nov|Dec)[a-z]*[- ,./]*\d{4})'
pattern2 = r'((?:Jan|Feb|Mar|Apr|May|Jun|Jul|Aug|Sep|Oct|Nov|Dec)[a-z]*[ ,./-]*\d{1,2}[ ,./-]*\d{4})'
pattern3 = r'((?:Jan|Feb|Mar|Apr|May|Jun|Jul|Aug|Sep|Oct|Nov|Dec)[a-z]*[ ,./-]*\d{4})'
pattern4 = r'((?:\d{1,2}[/-]\d{1,2}[/-](?:\d{4}|\d{2})))'
pattern5 = r'(?:(\s\d{2}[/-](?:\d{4})))'
pattern6 = r'(?:\d{4})'
It might be useful to set up some intermediate variables.
import re
short_month_names = (
'Jan', 'Feb', 'Mar', 'Apr', 'May', 'Jun',
'Jul', 'Aug', 'Sep', 'Oct', 'Nov', 'Dec'
)
long_month_names = (
'January', 'February', 'March', 'April', 'May', 'June', 'July',
'August', 'September', 'October', 'November', 'December'
)
short_month_cap = '(?:' + '|'.join(short_month_names) + ')'
long_month_cap = '(?:' + '|'.join(long_month_names) + ')'
short_num_month_cap = '(?:[1-9]|1[12])'
long_num_month_cap = '(?:0[1-9]|1[12])'
long_day_cap = '(?:0[1-9]|[12][0-9]|3[01])'
short_day_cap = '(?:[1-9]|[12][0-9]|3[01])'
long_year_cap = '(?:[0-9]{3}[1-9]|[0-9]{2}[1-9][0-9]|[0-9][1-9][0-9]{2}|[1-9][0-9]{3})'
short_year_cap = '(?:[0-9][0-9])'
ordinal_day = '(?:2?1st|2?2nd|2?3rd|[12]?[4-9]th|1[123]th|[123]0th|31st)'
formats = (
r'(?P<month_0>{lnm}|{snm})/(?P<day_0>{ld}|{sd})/(?P<year_0>{sy}|{ly})',
r'(?P<month_1>{sm})\-(?P<day_1>{ld}|{sd})\-(?P<year_1>{ly})',
r'(?P<month_2>{sm}|{lm})(?:\.\s+|\s*)(?P<day_2>{ld}|{sd})(?:,\s+|\s*)(?P<year_2>{ly})',
r'(?P<day_3>{ld}|{sd})(?:[\.,]\s+|\s*)(?P<month_3>{lm}|{sm})(?:[\.,]\s+|\s*)(?P<year_3>{ly})',
r'(?P<month_4>{lm}|{sm})\s+(?P<year_4>{ly})',
r'(?P<month_5>{lnm}|{snm})/(?P<year_5>{ly})',
r'(?P<year_6>{ly})',
r'(?P<month_6>{sm})\s+(?P<day_4>(?={od})[0-9][0-9]?)..,\s*(?P<year_7>{ly})'
)
_pattern = '|'.join(
i.format(
sm=short_month_cap, lm=long_month_cap, snm=short_num_month_cap,
lnm=long_num_month_cap, ld=long_day_cap, sd=short_day_cap,
ly=long_year_cap, sy=short_year_cap, od=ordinal_day
) for i in formats
)
pattern = re.compile(_pattern)
def get_fields(match):
if not match:
return None
return {
k[:-2]: v
for k, v in match.groupdict().items()
if v is not None
}
tests = r'''04/20/2009; 04/20/09; 4/20/09; 4/3/09
Mar-20-2009; Mar 20, 2009; March 20, 2009; Mar. 20, 2009; Mar 20 2009
20 Mar 2009; 20 March 2009; 20 Mar. 2009; 20 March, 2009
Mar 20th, 2009; Mar 21st, 2009; Mar 22nd, 2009
Feb 2009; Sep 2009; Oct 2010
6/2008; 12/2009
2009; 2010'''
for test_line in tests.split('\n'):
for test in test_line.split('; '):
print('{!r}: {!r}'.format(test, get_fields(pattern.fullmatch(test))))
print('')
Which outputs:
'04/20/2009': {'month': '04', 'day': '20', 'year': '2009'}
'04/20/09': {'month': '04', 'day': '20', 'year': '09'}
'4/20/09': {'month': '4', 'day': '20', 'year': '09'}
'4/3/09': {'month': '4', 'day': '3', 'year': '09'}
'Mar-20-2009': {'month': 'Mar', 'day': '20', 'year': '2009'}
'Mar 20, 2009': {'month': 'Mar', 'day': '20', 'year': '2009'}
'March 20, 2009': {'month': 'March', 'day': '20', 'year': '2009'}
'Mar. 20, 2009': {'month': 'Mar', 'day': '20', 'year': '2009'}
'Mar 20 2009': {'month': 'Mar', 'day': '20', 'year': '2009'}
'20 Mar 2009': {'day': '20', 'month': 'Mar', 'year': '2009'}
'20 March 2009': {'day': '20', 'month': 'March', 'year': '2009'}
'20 Mar. 2009': {'day': '20', 'month': 'Mar', 'year': '2009'}
'20 March, 2009': {'day': '20', 'month': 'March', 'year': '2009'}
'Mar 20th, 2009': {'month': 'Mar', 'day': '20', 'year': '2009'}
'Mar 21st, 2009': {'month': 'Mar', 'day': '21', 'year': '2009'}
'Mar 22nd, 2009': {'month': 'Mar', 'day': '22', 'year': '2009'}
'Feb 2009': {'month': 'Feb', 'year': '2009'}
'Sep 2009': {'month': 'Sep', 'year': '2009'}
'Oct 2010': {'month': 'Oct', 'year': '2010'}
'6/2008': {'month': '6', 'year': '2008'}
'12/2009': {'month': '12', 'year': '2009'}
'2009': {'year': '2009'}
'2010': {'year': '2010'}
The main part is the formats variable, where all the different formats are defined. It matches slightly more than what is defined, and can easily be extended.
The overall pattern ends up being:
'(?P<month_0>(?:0[1-9]|1[12])|(?:[1-9]|1[12]))/(?P<day_0>(?:0[1-9]|[12][0-9]|3[01])|(?:[1-9]|[12][0-9]|3[01]))/(?P<year_0>(?:[0-9][0-9])|(?:[0-9]{3}[1-9]|[0-9]{2}[1-9][0-9]|[0-9][1-9][0-9]{2}|[1-9][0-9]{3}))|(?P<month_1>(?:Jan|Feb|Mar|Apr|May|Jun|Jul|Aug|Sep|Oct|Nov|Dec))\\-(?P<day_1>(?:0[1-9]|[12][0-9]|3[01])|(?:[1-9]|[12][0-9]|3[01]))\\-(?P<year_1>(?:[0-9]{3}[1-9]|[0-9]{2}[1-9][0-9]|[0-9][1-9][0-9]{2}|[1-9][0-9]{3}))|(?P<month_2>(?:Jan|Feb|Mar|Apr|May|Jun|Jul|Aug|Sep|Oct|Nov|Dec)|(?:January|February|March|April|May|June|July|August|September|October|November|December))(?:\\.\\s+|\\s*)(?P<day_2>(?:0[1-9]|[12][0-9]|3[01])|(?:[1-9]|[12][0-9]|3[01]))(?:,\\s+|\\s*)(?P<year_2>(?:[0-9]{3}[1-9]|[0-9]{2}[1-9][0-9]|[0-9][1-9][0-9]{2}|[1-9][0-9]{3}))|(?P<day_3>(?:0[1-9]|[12][0-9]|3[01])|(?:[1-9]|[12][0-9]|3[01]))(?:[\\.,]\\s+|\\s*)(?P<month_3>(?:January|February|March|April|May|June|July|August|September|October|November|December)|(?:Jan|Feb|Mar|Apr|May|Jun|Jul|Aug|Sep|Oct|Nov|Dec))(?:[\\.,]\\s+|\\s*)(?P<year_3>(?:[0-9]{3}[1-9]|[0-9]{2}[1-9][0-9]|[0-9][1-9][0-9]{2}|[1-9][0-9]{3}))|(?P<month_4>(?:January|February|March|April|May|June|July|August|September|October|November|December)|(?:Jan|Feb|Mar|Apr|May|Jun|Jul|Aug|Sep|Oct|Nov|Dec))\\s+(?P<year_4>(?:[0-9]{3}[1-9]|[0-9]{2}[1-9][0-9]|[0-9][1-9][0-9]{2}|[1-9][0-9]{3}))|(?P<month_5>(?:0[1-9]|1[12])|(?:[1-9]|1[12]))/(?P<year_5>(?:[0-9]{3}[1-9]|[0-9]{2}[1-9][0-9]|[0-9][1-9][0-9]{2}|[1-9][0-9]{3}))|(?P<year_6>(?:[0-9]{3}[1-9]|[0-9]{2}[1-9][0-9]|[0-9][1-9][0-9]{2}|[1-9][0-9]{3}))|(?P<month_6>(?:Jan|Feb|Mar|Apr|May|Jun|Jul|Aug|Sep|Oct|Nov|Dec))\\s+(?P<day_4>(?=(?:2?1st|2?2nd|2?3rd|[12]?[4-9]th|1[123]th|[123]0th|31st))[0-9][0-9]?)..,\\s*(?P<year_7>(?:[0-9]{3}[1-9]|[0-9]{2}[1-9][0-9]|[0-9][1-9][0-9]{2}|[1-9][0-9]{3}))'
Which would have been virtually impossible to write by hand.
The bounds for the "between random text" can be added around _pattern.
I would suggest _pattern = r'\b(?:{})\b'.format(_pattern).