I have a dataset as following
df = pd.DataFrame([[2020, 'Jan', 1],
[2020, 'Jan', 2],
[2020, 'Jan', 3],
[2020, 'Feb', 4],
[2020, 'Feb', 5],
[2020, 'Feb', 6],
[2021, 'Jan', 7],
[2021, 'Jan', 8],
[2021, 'Jan', 9],
[2021, 'Feb', 10],
[2021, 'Feb', 11],
[2021, 'Feb', 12],
[2022, 'Jan', 13],
[2022, 'Jan', 14],
[2022, 'Jan', 15],
[2022, 'Feb', 16],
[2022, 'Feb', 17],
[2022, 'Feb', 18]],
columns=['Year', 'Month', 'Sale ($)'])
which shows the sales of different months and years.
Using pandas.pivot_table function, I can create a pivot table to calculate sum of sales for different months and years.
df.pivot_table(index='Month', columns='Year', aggfunc='sum')
And that generates a summary statistics table as following:
2020
2021
2022
Jan
6
24
42
Feb
15
33
51
Here is my question: How can I identify the month at which in each year I had the highest sale?
I know that for the years 2020, 2021, and 2022, my highest sales were 15, 33, and $ 51 respectively.
I can achieve this by adding .max() into the end of the code
df.pivot_table(index='Month', columns='Year',aggfunc='sum').max()
and this returns exactly that:
Year
Sale
2020
15
2021
33
2022
51
And in the case of this example, the maximum sales were all in February so how can I write a function that returns not the maximum value, but the month which had the maximum sale?
Related
I have the following dataframe:
df = pd.DataFrame(
{
'status': ['open', 'closed', 'open', 'closed', 'open', 'closed', 'open', 'closed'],
'month': ['January 2020', 'January 2020', 'February 2020', 'February 2020', 'April 2020', 'April 2020', 'August 2020', 'August 2020'],
'counts': [10, 12, 32, 12, 19, 40, 10, 11]
}
)
status month counts
0 open January 2020 10
1 closed January 2020 12
2 open February 2020 32
3 closed February 2020 12
4 open April 2020 19
5 closed April 2020 40
6 open August 2020 10
7 closed August 2020 11
I'm trying to get a stacked bar plot using seaborn:
sns.histplot(df, x='month', weights='counts', hue='status', multiple='stack')
The purpose is to get a plot with a continuous timeseries without missing months. How can I fill in the missing rows with values so that the dataframe would look like below?
status month counts
open January 2020 10
closed January 2020 12
open February 2020 32
closed February 2020 12
open March 2020 0
closed March 2020 0
open April 2020 19
closed April 2020 40
open May 2020 0
closed May 2020 0
open June 2020 0
closed June 2020 0
open July 2020 0
closed July 2020 0
open August 2020 10
closed August 2020 11
You could pivot the dataframe, and then reindex with the desired months.
import pandas as pd
df = pd.DataFrame({'status': ['open', 'closed', 'open', 'closed', 'open', 'closed', 'open', 'closed'],
'month': ['January 2020', 'January 2020', 'February 2020', 'February 2020', 'April 2020', 'April 2020', 'August 2020', 'August 2020'],
'counts': [10, 12, 32, 12, 19, 40, 10, 11]})
months = [f'{m} 2020' for m in ['January', 'February', 'March', 'April', 'May', 'June', 'July', 'August']]
df_pivoted = df.pivot(values='counts', index='month', columns='status').reindex(months).fillna(0)
ax = df_pivoted.plot.bar(stacked=True, width=1, ec='black', rot=0, figsize=(12, 5))
A seaborn solution, could use order=. That doesn't work with a histplot, only with a barplot, which doesn't stack bars.
import matplotlib.pyplot as plt
import seaborn as sns
import pandas as pd
df = pd.DataFrame({'status': ['open', 'closed', 'open', 'closed', 'open', 'closed', 'open', 'closed'],
'month': ['January 2020', 'January 2020', 'February 2020', 'February 2020', 'April 2020', 'April 2020', 'August 2020', 'August 2020'],
'counts': [10, 12, 32, 12, 19, 40, 10, 11]})
plt.figure(figsize=(12, 5))
months = [f'{m} 2020' for m in ['January', 'February', 'March', 'April', 'May', 'June', 'July', 'August']]
ax = sns.barplot(data=df, x='month', y='counts', hue='status', order=months)
plt.tight_layout()
plt.show()
I currently have a dataframe like this:
Month Day Birthday
0 Jan 31 Yes
1 Apr 30 No
2 Mar 31 Yes
3 June 30 Yes
How would I select the columns dynamically and append them to another list? For example:
d= [Birthday, Month, Day, ...]
xx=[]
for f in range(len(d)):
loan = df[f].values.tolist()
xx.append(loan)
So that xx is the following:
[Yes,No,Yes,Yes,Jan,Apr,Mar,June,...]
Something like this but on a much larger scale.
I keep getting the following error:
KeyError: 0
When I try
for f in range(len(d)):
loan = df[d[f]].values.tolist()
I get
IndexError: list index out of range
Try:
out = df[d].T.values.flatten().tolist()
With df[d], you can change the position of the columns according to d, the transpose the dataframe using T, select values as array, flatten the array and convert to list.
Output:
['Yes', 'No', 'Yes', 'Yes', 'Jan', 'Apr', 'Mar', 'June', 31, 30, 31, 30]
Try this:
xx = df.T.to_numpy().flatten().tolist()
Output:
>>> xx
['Jan', 'Apr', 'Mar', 'June', 31, 30, 31, 30, 'Yes', 'No', 'Yes', 'Yes']
Or,
xx = df.to_numpy().flatten().tolist()
Output:
>>> xx
['Jan', 31, 'Yes', 'Apr', 30, 'No', 'Mar', 31, 'Yes', 'June', 30, 'Yes']
This question already has answers here:
Python Regular Expressions to extract date
(5 answers)
Closed 4 years ago.
I want to extract a part of the string if the matching element is present in the list for example I have a string s and list l1:
s = 'Vipul Singh, Jun 24, 1995'
l1 = ['Jan', 'Feb', 'Mar', 'Apr', 'May', 'Jun', 'Jul', 'Aug', 'Sep', 'Oct', 'Nov', 'Dec']
Now I want to extract the substring of string s from 'Jun 24, 1995' since 'Jun' is present in list l1.
So this is how I want my substring to be, I tried many regex, str functions but no result.
Note: I have many string of similar type as
vipul singh, Jan 1, 2017, 10:00,
ANI,May 6, 2009, 14:59 IST,
It looks like you just need to extract the dates, and since they share a common format, this is an easy problem for regular expressions.
Try using [a-zA-Z]{3}\s[0-9]{1,2},\s[0-9]{4}
s = """
Vipul Singh, Jun 24, 1995
vipul singh, Jan 1, 2017, 10:00,
ANI,May 6, 2009, 14:59 IST,
"""
import re
dates = re.findall(r'[a-zA-Z]{3}\s[0-9]{1,2},\s[0-9]{4}', s)
print(dates)
Output:
['Jun 24, 1995', 'Jan 1, 2017', 'May 6, 2009']
If you are concerned about matching something like 'ABC 23, 1111', you could only accept valid months as the first 3 letters:
months = ['Jan', 'Feb', 'Mar', 'Apr', 'May', 'Jun', 'Jul', 'Aug', 'Sep', 'Oct', 'Nov', 'Dec']
dates = re.findall(r'(?:{})\s[0-9]{{1,2}},\s[0-9]{{4}}'.format('|'.join(months)), s)
I am trying to plot a chart with the 1st and 2nd columns of data as bars and then a line overlay for the 3rd column of data.
I have tried the following code but this creates 2 separate charts but I would like this all on one chart.
left_2013 = pd.DataFrame({'month': ['jan', 'feb', 'mar', 'apr', 'may', 'jun', 'jul', 'aug', 'sep', 'oct', 'nov', 'dec'],
'2013_val': [1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 9, 6]})
right_2014 = pd.DataFrame({'month': ['jan', 'feb'], '2014_val': [4, 5]})
right_2014_target = pd.DataFrame({'month': ['jan', 'feb', 'mar', 'apr', 'may', 'jun', 'jul', 'aug', 'sep', 'oct', 'nov', 'dec'],
'2014_target_val': [2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13]})
df_13_14 = pd.merge(left_2013, right_2014, how='outer')
df_13_14_target = pd.merge(df_13_14, right_2014_target, how='outer')
df_13_14_target[['month','2013_val','2014_val','2014_target_val']].head(12)
plt.figure()
df_13_14_target[['month','2014_target_val']].plot(x='month',linestyle='-', marker='o')
df_13_14_target[['month','2013_val','2014_val']].plot(x='month', kind='bar')
This is what I currently get
The DataFrame plotting methods return a matplotlib AxesSubplot or list of AxesSubplots. (See the docs for plot, or boxplot, for instance.)
You can then pass that same Axes to the next plotting method (using ax=ax) to draw on the same axes:
ax = df_13_14_target[['month','2014_target_val']].plot(x='month',linestyle='-', marker='o')
df_13_14_target[['month','2013_val','2014_val']].plot(x='month', kind='bar',
ax=ax)
import pandas as pd
import matplotlib.pyplot as plt
left_2013 = pd.DataFrame(
{'month': ['jan', 'feb', 'mar', 'apr', 'may', 'jun', 'jul', 'aug', 'sep',
'oct', 'nov', 'dec'],
'2013_val': [1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 9, 6]})
right_2014 = pd.DataFrame({'month': ['jan', 'feb'], '2014_val': [4, 5]})
right_2014_target = pd.DataFrame(
{'month': ['jan', 'feb', 'mar', 'apr', 'may', 'jun', 'jul', 'aug', 'sep',
'oct', 'nov', 'dec'],
'2014_target_val': [2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13]})
df_13_14 = pd.merge(left_2013, right_2014, how='outer')
df_13_14_target = pd.merge(df_13_14, right_2014_target, how='outer')
ax = df_13_14_target[['month', '2014_target_val']].plot(
x='month', linestyle='-', marker='o')
df_13_14_target[['month', '2013_val', '2014_val']].plot(x='month', kind='bar',
ax=ax)
plt.show()
I need to extract dates from strings using regex in python and the dates can be in one of many formats, and between some random text.
The date formats are:
04/20/2009; 04/20/09; 4/20/09; 4/3/09
Mar-20-2009; Mar 20, 2009; March 20, 2009; Mar. 20, 2009; Mar 20 2009;
20 Mar 2009; 20 March 2009; 20 Mar. 2009; 20 March, 2009
Mar 20th, 2009; Mar 21st, 2009; Mar 22nd, 2009
Feb 2009; Sep 2009; Oct 2010
6/2008; 12/2009
2009; 2010
After extract the dates I need to sort them ascending.
I've tried to use those 6 regex patterns but it seems that it's not doing all the job.
pattern1 = r'((?:\d{1,2}[- ,./]*)(?:Jan|Feb|Mar|Apr|May|Jun|Jul|Aug|Sep|Oct|Nov|Dec)[a-z]*[- ,./]*\d{4})'
pattern2 = r'((?:Jan|Feb|Mar|Apr|May|Jun|Jul|Aug|Sep|Oct|Nov|Dec)[a-z]*[ ,./-]*\d{1,2}[ ,./-]*\d{4})'
pattern3 = r'((?:Jan|Feb|Mar|Apr|May|Jun|Jul|Aug|Sep|Oct|Nov|Dec)[a-z]*[ ,./-]*\d{4})'
pattern4 = r'((?:\d{1,2}[/-]\d{1,2}[/-](?:\d{4}|\d{2})))'
pattern5 = r'(?:(\s\d{2}[/-](?:\d{4})))'
pattern6 = r'(?:\d{4})'
It might be useful to set up some intermediate variables.
import re
short_month_names = (
'Jan', 'Feb', 'Mar', 'Apr', 'May', 'Jun',
'Jul', 'Aug', 'Sep', 'Oct', 'Nov', 'Dec'
)
long_month_names = (
'January', 'February', 'March', 'April', 'May', 'June', 'July',
'August', 'September', 'October', 'November', 'December'
)
short_month_cap = '(?:' + '|'.join(short_month_names) + ')'
long_month_cap = '(?:' + '|'.join(long_month_names) + ')'
short_num_month_cap = '(?:[1-9]|1[12])'
long_num_month_cap = '(?:0[1-9]|1[12])'
long_day_cap = '(?:0[1-9]|[12][0-9]|3[01])'
short_day_cap = '(?:[1-9]|[12][0-9]|3[01])'
long_year_cap = '(?:[0-9]{3}[1-9]|[0-9]{2}[1-9][0-9]|[0-9][1-9][0-9]{2}|[1-9][0-9]{3})'
short_year_cap = '(?:[0-9][0-9])'
ordinal_day = '(?:2?1st|2?2nd|2?3rd|[12]?[4-9]th|1[123]th|[123]0th|31st)'
formats = (
r'(?P<month_0>{lnm}|{snm})/(?P<day_0>{ld}|{sd})/(?P<year_0>{sy}|{ly})',
r'(?P<month_1>{sm})\-(?P<day_1>{ld}|{sd})\-(?P<year_1>{ly})',
r'(?P<month_2>{sm}|{lm})(?:\.\s+|\s*)(?P<day_2>{ld}|{sd})(?:,\s+|\s*)(?P<year_2>{ly})',
r'(?P<day_3>{ld}|{sd})(?:[\.,]\s+|\s*)(?P<month_3>{lm}|{sm})(?:[\.,]\s+|\s*)(?P<year_3>{ly})',
r'(?P<month_4>{lm}|{sm})\s+(?P<year_4>{ly})',
r'(?P<month_5>{lnm}|{snm})/(?P<year_5>{ly})',
r'(?P<year_6>{ly})',
r'(?P<month_6>{sm})\s+(?P<day_4>(?={od})[0-9][0-9]?)..,\s*(?P<year_7>{ly})'
)
_pattern = '|'.join(
i.format(
sm=short_month_cap, lm=long_month_cap, snm=short_num_month_cap,
lnm=long_num_month_cap, ld=long_day_cap, sd=short_day_cap,
ly=long_year_cap, sy=short_year_cap, od=ordinal_day
) for i in formats
)
pattern = re.compile(_pattern)
def get_fields(match):
if not match:
return None
return {
k[:-2]: v
for k, v in match.groupdict().items()
if v is not None
}
tests = r'''04/20/2009; 04/20/09; 4/20/09; 4/3/09
Mar-20-2009; Mar 20, 2009; March 20, 2009; Mar. 20, 2009; Mar 20 2009
20 Mar 2009; 20 March 2009; 20 Mar. 2009; 20 March, 2009
Mar 20th, 2009; Mar 21st, 2009; Mar 22nd, 2009
Feb 2009; Sep 2009; Oct 2010
6/2008; 12/2009
2009; 2010'''
for test_line in tests.split('\n'):
for test in test_line.split('; '):
print('{!r}: {!r}'.format(test, get_fields(pattern.fullmatch(test))))
print('')
Which outputs:
'04/20/2009': {'month': '04', 'day': '20', 'year': '2009'}
'04/20/09': {'month': '04', 'day': '20', 'year': '09'}
'4/20/09': {'month': '4', 'day': '20', 'year': '09'}
'4/3/09': {'month': '4', 'day': '3', 'year': '09'}
'Mar-20-2009': {'month': 'Mar', 'day': '20', 'year': '2009'}
'Mar 20, 2009': {'month': 'Mar', 'day': '20', 'year': '2009'}
'March 20, 2009': {'month': 'March', 'day': '20', 'year': '2009'}
'Mar. 20, 2009': {'month': 'Mar', 'day': '20', 'year': '2009'}
'Mar 20 2009': {'month': 'Mar', 'day': '20', 'year': '2009'}
'20 Mar 2009': {'day': '20', 'month': 'Mar', 'year': '2009'}
'20 March 2009': {'day': '20', 'month': 'March', 'year': '2009'}
'20 Mar. 2009': {'day': '20', 'month': 'Mar', 'year': '2009'}
'20 March, 2009': {'day': '20', 'month': 'March', 'year': '2009'}
'Mar 20th, 2009': {'month': 'Mar', 'day': '20', 'year': '2009'}
'Mar 21st, 2009': {'month': 'Mar', 'day': '21', 'year': '2009'}
'Mar 22nd, 2009': {'month': 'Mar', 'day': '22', 'year': '2009'}
'Feb 2009': {'month': 'Feb', 'year': '2009'}
'Sep 2009': {'month': 'Sep', 'year': '2009'}
'Oct 2010': {'month': 'Oct', 'year': '2010'}
'6/2008': {'month': '6', 'year': '2008'}
'12/2009': {'month': '12', 'year': '2009'}
'2009': {'year': '2009'}
'2010': {'year': '2010'}
The main part is the formats variable, where all the different formats are defined. It matches slightly more than what is defined, and can easily be extended.
The overall pattern ends up being:
'(?P<month_0>(?:0[1-9]|1[12])|(?:[1-9]|1[12]))/(?P<day_0>(?:0[1-9]|[12][0-9]|3[01])|(?:[1-9]|[12][0-9]|3[01]))/(?P<year_0>(?:[0-9][0-9])|(?:[0-9]{3}[1-9]|[0-9]{2}[1-9][0-9]|[0-9][1-9][0-9]{2}|[1-9][0-9]{3}))|(?P<month_1>(?:Jan|Feb|Mar|Apr|May|Jun|Jul|Aug|Sep|Oct|Nov|Dec))\\-(?P<day_1>(?:0[1-9]|[12][0-9]|3[01])|(?:[1-9]|[12][0-9]|3[01]))\\-(?P<year_1>(?:[0-9]{3}[1-9]|[0-9]{2}[1-9][0-9]|[0-9][1-9][0-9]{2}|[1-9][0-9]{3}))|(?P<month_2>(?:Jan|Feb|Mar|Apr|May|Jun|Jul|Aug|Sep|Oct|Nov|Dec)|(?:January|February|March|April|May|June|July|August|September|October|November|December))(?:\\.\\s+|\\s*)(?P<day_2>(?:0[1-9]|[12][0-9]|3[01])|(?:[1-9]|[12][0-9]|3[01]))(?:,\\s+|\\s*)(?P<year_2>(?:[0-9]{3}[1-9]|[0-9]{2}[1-9][0-9]|[0-9][1-9][0-9]{2}|[1-9][0-9]{3}))|(?P<day_3>(?:0[1-9]|[12][0-9]|3[01])|(?:[1-9]|[12][0-9]|3[01]))(?:[\\.,]\\s+|\\s*)(?P<month_3>(?:January|February|March|April|May|June|July|August|September|October|November|December)|(?:Jan|Feb|Mar|Apr|May|Jun|Jul|Aug|Sep|Oct|Nov|Dec))(?:[\\.,]\\s+|\\s*)(?P<year_3>(?:[0-9]{3}[1-9]|[0-9]{2}[1-9][0-9]|[0-9][1-9][0-9]{2}|[1-9][0-9]{3}))|(?P<month_4>(?:January|February|March|April|May|June|July|August|September|October|November|December)|(?:Jan|Feb|Mar|Apr|May|Jun|Jul|Aug|Sep|Oct|Nov|Dec))\\s+(?P<year_4>(?:[0-9]{3}[1-9]|[0-9]{2}[1-9][0-9]|[0-9][1-9][0-9]{2}|[1-9][0-9]{3}))|(?P<month_5>(?:0[1-9]|1[12])|(?:[1-9]|1[12]))/(?P<year_5>(?:[0-9]{3}[1-9]|[0-9]{2}[1-9][0-9]|[0-9][1-9][0-9]{2}|[1-9][0-9]{3}))|(?P<year_6>(?:[0-9]{3}[1-9]|[0-9]{2}[1-9][0-9]|[0-9][1-9][0-9]{2}|[1-9][0-9]{3}))|(?P<month_6>(?:Jan|Feb|Mar|Apr|May|Jun|Jul|Aug|Sep|Oct|Nov|Dec))\\s+(?P<day_4>(?=(?:2?1st|2?2nd|2?3rd|[12]?[4-9]th|1[123]th|[123]0th|31st))[0-9][0-9]?)..,\\s*(?P<year_7>(?:[0-9]{3}[1-9]|[0-9]{2}[1-9][0-9]|[0-9][1-9][0-9]{2}|[1-9][0-9]{3}))'
Which would have been virtually impossible to write by hand.
The bounds for the "between random text" can be added around _pattern.
I would suggest _pattern = r'\b(?:{})\b'.format(_pattern).