Fill in missing values for missing dates in dataframe - python

I have the following dataframe:
df = pd.DataFrame(
{
'status': ['open', 'closed', 'open', 'closed', 'open', 'closed', 'open', 'closed'],
'month': ['January 2020', 'January 2020', 'February 2020', 'February 2020', 'April 2020', 'April 2020', 'August 2020', 'August 2020'],
'counts': [10, 12, 32, 12, 19, 40, 10, 11]
}
)
status month counts
0 open January 2020 10
1 closed January 2020 12
2 open February 2020 32
3 closed February 2020 12
4 open April 2020 19
5 closed April 2020 40
6 open August 2020 10
7 closed August 2020 11
I'm trying to get a stacked bar plot using seaborn:
sns.histplot(df, x='month', weights='counts', hue='status', multiple='stack')
The purpose is to get a plot with a continuous timeseries without missing months. How can I fill in the missing rows with values so that the dataframe would look like below?
status month counts
open January 2020 10
closed January 2020 12
open February 2020 32
closed February 2020 12
open March 2020 0
closed March 2020 0
open April 2020 19
closed April 2020 40
open May 2020 0
closed May 2020 0
open June 2020 0
closed June 2020 0
open July 2020 0
closed July 2020 0
open August 2020 10
closed August 2020 11

You could pivot the dataframe, and then reindex with the desired months.
import pandas as pd
df = pd.DataFrame({'status': ['open', 'closed', 'open', 'closed', 'open', 'closed', 'open', 'closed'],
'month': ['January 2020', 'January 2020', 'February 2020', 'February 2020', 'April 2020', 'April 2020', 'August 2020', 'August 2020'],
'counts': [10, 12, 32, 12, 19, 40, 10, 11]})
months = [f'{m} 2020' for m in ['January', 'February', 'March', 'April', 'May', 'June', 'July', 'August']]
df_pivoted = df.pivot(values='counts', index='month', columns='status').reindex(months).fillna(0)
ax = df_pivoted.plot.bar(stacked=True, width=1, ec='black', rot=0, figsize=(12, 5))
A seaborn solution, could use order=. That doesn't work with a histplot, only with a barplot, which doesn't stack bars.
import matplotlib.pyplot as plt
import seaborn as sns
import pandas as pd
df = pd.DataFrame({'status': ['open', 'closed', 'open', 'closed', 'open', 'closed', 'open', 'closed'],
'month': ['January 2020', 'January 2020', 'February 2020', 'February 2020', 'April 2020', 'April 2020', 'August 2020', 'August 2020'],
'counts': [10, 12, 32, 12, 19, 40, 10, 11]})
plt.figure(figsize=(12, 5))
months = [f'{m} 2020' for m in ['January', 'February', 'March', 'April', 'May', 'June', 'July', 'August']]
ax = sns.barplot(data=df, x='month', y='counts', hue='status', order=months)
plt.tight_layout()
plt.show()

Related

set and sort function in loop

I need to be able to use function (get_years) to iterate a list of reviews such as:
{'rating': 5.0,
'reviewer_name': 'Karen',
'product_id': 'B00004RFRV',
'review_title': 'Bialetti is the Best!',
'review_time': '11 12, 2017',
'images': ['https://images-na.ssl-images-amazon.com/images/I/81+XxFRGyBL._SY88.jpg'],
'styles': {'Size:': ' 12-Cup', 'Color:': ' Silver'}}```
{'rating': 3.0,
'reviewer_name': 'Peter DP',
'product_id': 'B00005OTXM',
'review_title': "Mr. Coffee DWX23 12-cup doesn't have the quality feel as my 13 year old nearly identical 12-cup Mr. Coffee",
'review_time': '04 17, 2015',
'images': ['https://images-na.ssl-images-amazon.com/images/I/71sFKwTW9sL._SY88.jpg'],
'styles': {'Style Name:': ' COFFEE MAKER ONLY'}}
{'rating': 5.0,
'reviewer_name': 'B. Laska',
'product_id': 'B00004RFRV',
'review_title': 'Love my Moka pots!',
'review_time': '07 9, 2015',
'images': ['https://images-na.ssl-images-amazon.com/images/I/719NCqw4GML._SY88.jpg'],
'styles': {'Size:': ' 1-Cup', 'Color:': ' Silver'}}
to be able to return:
print(get_years(reviews)) # [2007, 2008, 2009, 2011, 2012, 2013, 2014, 2015, 2016, 2017, 2018]
print(type(get_years(reviews))) # <class 'list'>
I have:
def get_years(review):
review_years_set = set()
for review in reviews:
review_years_set.add(review['review_time'][-4:])
review_years_list = list(review_years_set)
review_years_list.sort()
return review_years_list
which gives me what I want but it seems like the longer route. Is there a more Pythonic or efficient way to get a sorted list of set values?
Given an iterable of string-formatted dates, e.g.:
dates = ['07 9, 2007', '04 1, 2008', '01 2, 2007', '08 2, 2014', '01 3, 2004', '01 4, 2004']
A concise way to produce a sorted list of unique years is as follows using set comprehension:
sorted_dates = sorted({int(date[-4:]) for date in dates})
print(sorted_dates)
Output:
[2004, 2007, 2008, 2014]
try this.
def get_years(reviews):
return sorted([review['review_time'][-4:] for review in reviews])
print(get_years(reviews))

max of different classes in pivot table pandas python

I have a dataset as following
df = pd.DataFrame([[2020, 'Jan', 1],
[2020, 'Jan', 2],
[2020, 'Jan', 3],
[2020, 'Feb', 4],
[2020, 'Feb', 5],
[2020, 'Feb', 6],
[2021, 'Jan', 7],
[2021, 'Jan', 8],
[2021, 'Jan', 9],
[2021, 'Feb', 10],
[2021, 'Feb', 11],
[2021, 'Feb', 12],
[2022, 'Jan', 13],
[2022, 'Jan', 14],
[2022, 'Jan', 15],
[2022, 'Feb', 16],
[2022, 'Feb', 17],
[2022, 'Feb', 18]],
columns=['Year', 'Month', 'Sale ($)'])
which shows the sales of different months and years.
Using pandas.pivot_table function, I can create a pivot table to calculate sum of sales for different months and years.
df.pivot_table(index='Month', columns='Year', aggfunc='sum')
And that generates a summary statistics table as following:
2020
2021
2022
Jan
6
24
42
Feb
15
33
51
Here is my question: How can I identify the month at which in each year I had the highest sale?
I know that for the years 2020, 2021, and 2022, my highest sales were 15, 33, and $ 51 respectively.
I can achieve this by adding .max() into the end of the code
df.pivot_table(index='Month', columns='Year',aggfunc='sum').max()
and this returns exactly that:
Year
Sale
2020
15
2021
33
2022
51
And in the case of this example, the maximum sales were all in February so how can I write a function that returns not the maximum value, but the month which had the maximum sale?

Sorting list that includes dates as strings

I have a following list
data = ['Sep 14, 2020', 'Sep 10, 2020', 'Sep 6, 2020', 'Aug 28, 2020',
'Aug 25, 2020', 'Aug 31, 2020', 'Aug 30, 2020', 'Aug 17, 2020',
'Nov 12, 2020', 'Dec 3, 2020', 'Dec 17, 2020', 'Dec 28, 2020', 'Dec 31, 2020']
I've tried looping and matching if date in element set that element behind another one but I cannot get it working
I want to sort list from first month Jan to last month Dec but I don't even know where to start
The elements in data are strings, therefore if you run sort on the list, it will be sorted alphabetically.
You can tell the sort function to sort them considering they are dates. To do this, you need to convert them using datetime.strptime like this datetime.strptime(element, '%b %d, %Y') (read more about about strptime here)
So your sort function becomes:
data = ['Sep 14, 2020', 'Sep 10, 2020', 'Sep 6, 2020', 'Aug 28, 2020', 'Aug 25, 2020', 'Aug 31, 2020', 'Aug 30, 2020', 'Aug 17, 2020', 'Nov 12, 2020', 'Dec 3, 2020', 'Dec 17, 2020', 'Dec 28, 2020', 'Dec 31, 2020']
data.sort(key=lambda date: datetime.strptime(date, '%b %d, %Y'))
print(data)
Outputs:
['Aug 17, 2020', 'Aug 25, 2020', 'Aug 28, 2020', 'Aug 30, 2020', 'Aug 31, 2020', 'Sep 6, 2020', 'Sep 10, 2020', 'Sep 14, 2020', 'Nov 12, 2020', 'Dec 3, 2020', 'Dec 17, 2020', 'Dec 28, 2020', 'Dec 31, 2020']
You can convert your strings into datetime objects and then sort them:
from datetime import datetime
data = ['Sep 14, 2020', 'Sep 10, 2020', 'Sep 6, 2020', 'Aug 28, 2020', 'Aug 25, 2020', 'Aug 31, 2020', 'Aug 30, 2020', 'Aug 17, 2020', 'Nov 12, 2020', 'Dec 3, 2020', 'Dec 17, 2020', 'Dec 28, 2020', 'Dec 31, 2020']
dates = list(map(lambda time_str: datetime.strptime(time_str, "%b %d, %Y"), data))
dates.sort()
print(dates)
# or if you want them as strings
print(list(map(lambda x: x.strftime("%b %d, %Y"), dates)))
What you had:
data = ['Sep 14, 2020', 'Sep 10, 2020', 'Sep 6, 2020', 'Aug 28, 2020', 'Aug 25, 2020',
'Aug 31, 2020', 'Aug 30, 2020', 'Aug 17, 2020', 'Nov 12, 2020', 'Dec 3, 2020',
'Dec 17, 2020', 'Dec 28, 2020', 'Dec 31, 2020']
What you need to add:
import datetime
r = lambda x: datetime.datetime.strptime(x, '%b %d, %Y')
data.sort(key=r)
Result:
['Aug 17, 2020',
'Aug 25, 2020',
'Aug 28, 2020',
'Aug 30, 2020',
'Aug 31, 2020',
'Sep 6, 2020',
'Sep 10, 2020',
'Sep 14, 2020',
'Nov 12, 2020',
'Dec 3, 2020',
'Dec 17, 2020',
'Dec 28, 2020',
'Dec 31, 2020']
You can pass a key argument to the sort method which will tell Python how to sort the items. For example, suppose I wanted to sort a list of strings based on the alphabetical order of their last letters.
words = ['alfa', 'bravo', 'charlie']
words.sort(key=lambda x: x[-1])
print(words)
Output:
['alfa', 'charlie', 'bravo']
So, you would need to write a function to pass as the key to tell when one date string is "less than" another. This would work:
months = ['Jan', 'Feb', 'Mar', 'Apr', 'May', 'Jun', 'Jul', 'Aug', 'Sep', 'Oct', 'Nov', 'Dec']
data=sort(key=lambda x: months.index(x.split()[0])*100 + int(x.split()[1]))

How can I sort dates that are in String? (python)

I've only tried
from datetime import datetime
my_dates = ['5-Nov-18', '25-Mar-17', '1-Nov-18', '7 Mar 17']
my_dates.sort(key=lambda date: datetime.strptime(date, "%d-%b-%y"))
print(my_dates)
But how can I make this work for date formats like
my_dates = ['05 Nov 2018', '25 Mar 2017', '01 Nov 2018', '07 Mar 2017']
One inelegant solution which comes to mind is replacing all spaces with dashes as shown below:
from datetime import datetime
my_dates = ['5-Nov-18', '25-Mar-17', '1-Nov-18', '7 Mar 17']
my_dates.sort(key=lambda date: datetime.strptime(date.replace(' ', '-'), "%d-%b-%y"))
print(my_dates)
If all you're looking is the specific set of dates you have provided, just change up the format in strptime():
my_dates = ['05 Nov 2018', '25 Mar 2017', '01 Nov 2018', '07 Mar 2017']
my_dates.sort(key=lambda date: datetime.strptime(date, "%d %b %Y"))
# Change the last %y to %Y
However, if you have varying date formats in your list, you could prepare a list of possibly anticipated string formats and define your own function to parse against each format:
def func(date, formats):
for frmt in formats:
try:
str_date = datetime.strptime(date, frmt)
return str_date
except ValueError:
continue
# might want to consider handling a scenario when nothing is returned
my_formats = ['%d-%b-%y', '%d %b %y', '%d %b %Y']
my_dates = ['5-Nov-18', '25-Mar-17', '1-Nov-18', '7 Mar 17', '05 Nov 2018', '25 Mar 2017', '01 Nov 2018', '07 Mar 2017']
my_dates.sort(key=lambda date: func(date, my_formats))
print(my_dates)
# ['7 Mar 17', '07 Mar 2017', '25-Mar-17', '25 Mar 2017', '1-Nov-18', '01 Nov 2018', '5-Nov-18', '05 Nov 2018']
The caveat here is if an unanticipated date format shows up, the function will return None so it won't be sorted properly. If that's a concern, you might want to add some handling at the end of the func() when all parsing attempts failed. Some devs might also say to avoid try...except..., but I can only come up with this way.
dates can be sorted once it's a datetime object
my_dates = ['05 Nov 2018', '25 Mar 2017', '01 Nov 2018', '07 Mar 2017']
my_dates = [ dt.strptime(date, '%d %b %Y') for date in my_dates ]
print(my_dates)
# [datetime.datetime(2018, 11, 5, 0, 0), datetime.datetime(2017, 3, 25, 0, 0), datetime.datetime(2018, 11, 1, 0, 0), datetime.datetime(2017, 3, 7, 0, 0)]
my_dates.sort()
print(my_dates)
# [datetime.datetime(2017, 3, 7, 0, 0), datetime.datetime(2017, 3, 25, 0, 0), datetime.datetime(2018, 11, 1, 0, 0), datetime.datetime(2018, 11, 5, 0, 0)]
my_dates = [ date.strftime('%d %b %Y') for date in my_dates ]
print(my_dates)
# ['07 Mar 2017', '25 Mar 2017', '01 Nov 2018', '05 Nov 2018']

Extract dates in different formats from string using regex in python

I need to extract dates from strings using regex in python and the dates can be in one of many formats, and between some random text.
The date formats are:
04/20/2009; 04/20/09; 4/20/09; 4/3/09
Mar-20-2009; Mar 20, 2009; March 20, 2009; Mar. 20, 2009; Mar 20 2009;
20 Mar 2009; 20 March 2009; 20 Mar. 2009; 20 March, 2009
Mar 20th, 2009; Mar 21st, 2009; Mar 22nd, 2009
Feb 2009; Sep 2009; Oct 2010
6/2008; 12/2009
2009; 2010
After extract the dates I need to sort them ascending.
I've tried to use those 6 regex patterns but it seems that it's not doing all the job.
pattern1 = r'((?:\d{1,2}[- ,./]*)(?:Jan|Feb|Mar|Apr|May|Jun|Jul|Aug|Sep|Oct|Nov|Dec)[a-z]*[- ,./]*\d{4})'
pattern2 = r'((?:Jan|Feb|Mar|Apr|May|Jun|Jul|Aug|Sep|Oct|Nov|Dec)[a-z]*[ ,./-]*\d{1,2}[ ,./-]*\d{4})'
pattern3 = r'((?:Jan|Feb|Mar|Apr|May|Jun|Jul|Aug|Sep|Oct|Nov|Dec)[a-z]*[ ,./-]*\d{4})'
pattern4 = r'((?:\d{1,2}[/-]\d{1,2}[/-](?:\d{4}|\d{2})))'
pattern5 = r'(?:(\s\d{2}[/-](?:\d{4})))'
pattern6 = r'(?:\d{4})'
It might be useful to set up some intermediate variables.
import re
short_month_names = (
'Jan', 'Feb', 'Mar', 'Apr', 'May', 'Jun',
'Jul', 'Aug', 'Sep', 'Oct', 'Nov', 'Dec'
)
long_month_names = (
'January', 'February', 'March', 'April', 'May', 'June', 'July',
'August', 'September', 'October', 'November', 'December'
)
short_month_cap = '(?:' + '|'.join(short_month_names) + ')'
long_month_cap = '(?:' + '|'.join(long_month_names) + ')'
short_num_month_cap = '(?:[1-9]|1[12])'
long_num_month_cap = '(?:0[1-9]|1[12])'
long_day_cap = '(?:0[1-9]|[12][0-9]|3[01])'
short_day_cap = '(?:[1-9]|[12][0-9]|3[01])'
long_year_cap = '(?:[0-9]{3}[1-9]|[0-9]{2}[1-9][0-9]|[0-9][1-9][0-9]{2}|[1-9][0-9]{3})'
short_year_cap = '(?:[0-9][0-9])'
ordinal_day = '(?:2?1st|2?2nd|2?3rd|[12]?[4-9]th|1[123]th|[123]0th|31st)'
formats = (
r'(?P<month_0>{lnm}|{snm})/(?P<day_0>{ld}|{sd})/(?P<year_0>{sy}|{ly})',
r'(?P<month_1>{sm})\-(?P<day_1>{ld}|{sd})\-(?P<year_1>{ly})',
r'(?P<month_2>{sm}|{lm})(?:\.\s+|\s*)(?P<day_2>{ld}|{sd})(?:,\s+|\s*)(?P<year_2>{ly})',
r'(?P<day_3>{ld}|{sd})(?:[\.,]\s+|\s*)(?P<month_3>{lm}|{sm})(?:[\.,]\s+|\s*)(?P<year_3>{ly})',
r'(?P<month_4>{lm}|{sm})\s+(?P<year_4>{ly})',
r'(?P<month_5>{lnm}|{snm})/(?P<year_5>{ly})',
r'(?P<year_6>{ly})',
r'(?P<month_6>{sm})\s+(?P<day_4>(?={od})[0-9][0-9]?)..,\s*(?P<year_7>{ly})'
)
_pattern = '|'.join(
i.format(
sm=short_month_cap, lm=long_month_cap, snm=short_num_month_cap,
lnm=long_num_month_cap, ld=long_day_cap, sd=short_day_cap,
ly=long_year_cap, sy=short_year_cap, od=ordinal_day
) for i in formats
)
pattern = re.compile(_pattern)
def get_fields(match):
if not match:
return None
return {
k[:-2]: v
for k, v in match.groupdict().items()
if v is not None
}
tests = r'''04/20/2009; 04/20/09; 4/20/09; 4/3/09
Mar-20-2009; Mar 20, 2009; March 20, 2009; Mar. 20, 2009; Mar 20 2009
20 Mar 2009; 20 March 2009; 20 Mar. 2009; 20 March, 2009
Mar 20th, 2009; Mar 21st, 2009; Mar 22nd, 2009
Feb 2009; Sep 2009; Oct 2010
6/2008; 12/2009
2009; 2010'''
for test_line in tests.split('\n'):
for test in test_line.split('; '):
print('{!r}: {!r}'.format(test, get_fields(pattern.fullmatch(test))))
print('')
Which outputs:
'04/20/2009': {'month': '04', 'day': '20', 'year': '2009'}
'04/20/09': {'month': '04', 'day': '20', 'year': '09'}
'4/20/09': {'month': '4', 'day': '20', 'year': '09'}
'4/3/09': {'month': '4', 'day': '3', 'year': '09'}
'Mar-20-2009': {'month': 'Mar', 'day': '20', 'year': '2009'}
'Mar 20, 2009': {'month': 'Mar', 'day': '20', 'year': '2009'}
'March 20, 2009': {'month': 'March', 'day': '20', 'year': '2009'}
'Mar. 20, 2009': {'month': 'Mar', 'day': '20', 'year': '2009'}
'Mar 20 2009': {'month': 'Mar', 'day': '20', 'year': '2009'}
'20 Mar 2009': {'day': '20', 'month': 'Mar', 'year': '2009'}
'20 March 2009': {'day': '20', 'month': 'March', 'year': '2009'}
'20 Mar. 2009': {'day': '20', 'month': 'Mar', 'year': '2009'}
'20 March, 2009': {'day': '20', 'month': 'March', 'year': '2009'}
'Mar 20th, 2009': {'month': 'Mar', 'day': '20', 'year': '2009'}
'Mar 21st, 2009': {'month': 'Mar', 'day': '21', 'year': '2009'}
'Mar 22nd, 2009': {'month': 'Mar', 'day': '22', 'year': '2009'}
'Feb 2009': {'month': 'Feb', 'year': '2009'}
'Sep 2009': {'month': 'Sep', 'year': '2009'}
'Oct 2010': {'month': 'Oct', 'year': '2010'}
'6/2008': {'month': '6', 'year': '2008'}
'12/2009': {'month': '12', 'year': '2009'}
'2009': {'year': '2009'}
'2010': {'year': '2010'}
The main part is the formats variable, where all the different formats are defined. It matches slightly more than what is defined, and can easily be extended.
The overall pattern ends up being:
'(?P<month_0>(?:0[1-9]|1[12])|(?:[1-9]|1[12]))/(?P<day_0>(?:0[1-9]|[12][0-9]|3[01])|(?:[1-9]|[12][0-9]|3[01]))/(?P<year_0>(?:[0-9][0-9])|(?:[0-9]{3}[1-9]|[0-9]{2}[1-9][0-9]|[0-9][1-9][0-9]{2}|[1-9][0-9]{3}))|(?P<month_1>(?:Jan|Feb|Mar|Apr|May|Jun|Jul|Aug|Sep|Oct|Nov|Dec))\\-(?P<day_1>(?:0[1-9]|[12][0-9]|3[01])|(?:[1-9]|[12][0-9]|3[01]))\\-(?P<year_1>(?:[0-9]{3}[1-9]|[0-9]{2}[1-9][0-9]|[0-9][1-9][0-9]{2}|[1-9][0-9]{3}))|(?P<month_2>(?:Jan|Feb|Mar|Apr|May|Jun|Jul|Aug|Sep|Oct|Nov|Dec)|(?:January|February|March|April|May|June|July|August|September|October|November|December))(?:\\.\\s+|\\s*)(?P<day_2>(?:0[1-9]|[12][0-9]|3[01])|(?:[1-9]|[12][0-9]|3[01]))(?:,\\s+|\\s*)(?P<year_2>(?:[0-9]{3}[1-9]|[0-9]{2}[1-9][0-9]|[0-9][1-9][0-9]{2}|[1-9][0-9]{3}))|(?P<day_3>(?:0[1-9]|[12][0-9]|3[01])|(?:[1-9]|[12][0-9]|3[01]))(?:[\\.,]\\s+|\\s*)(?P<month_3>(?:January|February|March|April|May|June|July|August|September|October|November|December)|(?:Jan|Feb|Mar|Apr|May|Jun|Jul|Aug|Sep|Oct|Nov|Dec))(?:[\\.,]\\s+|\\s*)(?P<year_3>(?:[0-9]{3}[1-9]|[0-9]{2}[1-9][0-9]|[0-9][1-9][0-9]{2}|[1-9][0-9]{3}))|(?P<month_4>(?:January|February|March|April|May|June|July|August|September|October|November|December)|(?:Jan|Feb|Mar|Apr|May|Jun|Jul|Aug|Sep|Oct|Nov|Dec))\\s+(?P<year_4>(?:[0-9]{3}[1-9]|[0-9]{2}[1-9][0-9]|[0-9][1-9][0-9]{2}|[1-9][0-9]{3}))|(?P<month_5>(?:0[1-9]|1[12])|(?:[1-9]|1[12]))/(?P<year_5>(?:[0-9]{3}[1-9]|[0-9]{2}[1-9][0-9]|[0-9][1-9][0-9]{2}|[1-9][0-9]{3}))|(?P<year_6>(?:[0-9]{3}[1-9]|[0-9]{2}[1-9][0-9]|[0-9][1-9][0-9]{2}|[1-9][0-9]{3}))|(?P<month_6>(?:Jan|Feb|Mar|Apr|May|Jun|Jul|Aug|Sep|Oct|Nov|Dec))\\s+(?P<day_4>(?=(?:2?1st|2?2nd|2?3rd|[12]?[4-9]th|1[123]th|[123]0th|31st))[0-9][0-9]?)..,\\s*(?P<year_7>(?:[0-9]{3}[1-9]|[0-9]{2}[1-9][0-9]|[0-9][1-9][0-9]{2}|[1-9][0-9]{3}))'
Which would have been virtually impossible to write by hand.
The bounds for the "between random text" can be added around _pattern.
I would suggest _pattern = r'\b(?:{})\b'.format(_pattern).

Categories

Resources