Related
Out of all the months in the year, I need to code the month with largest total balance (it's June as all together June has the biggest "amount" value)
lst = [
{'account': 'x\\*', 'amount': 300, 'day': 3, 'month': 'June'},
{'account': 'y\\*', 'amount': 550, 'day': 9, 'month': 'May'},
{'account': 'z\\*', 'amount': -200, 'day': 21, 'month': 'June'},
{'account': 'g', 'amount': 80, 'day': 10, 'month': 'May'},
{'account': 'x\\*', 'amount': 30, 'day': 16, 'month': 'August'},
{'account': 'x\\*', 'amount': 100, 'day': 5, 'month': 'June'},
]
The problem is that both "amount" and the name of the months are values.
I tried to find the total for each month, but I need to use for loop to code the highest month "amount".
My attempt:
get_sum = lambda my_dict, month: sum(d['amount']
for d in my_list if d['month'] == month)
total_June = get_sum(my_list,'June')
total_August = get_sum(my_list),'August')
A simple solution with pandas.
import pandas as pd
lst = [
{'account': 'x\\*', 'amount': 300, 'day': 3, 'month': 'June'},
{'account': 'y\\*', 'amount': 550, 'day': 9, 'month': 'May'},
{'account': 'z\\*', 'amount': -200, 'day': 21, 'month': 'June'},
{'account': 'g', 'amount': 80, 'day': 10, 'month': 'May'},
{'account': 'x\\*', 'amount': 30, 'day': 16, 'month': 'August'},
{'account': 'x\\*', 'amount': 100, 'day': 5, 'month': 'June'},
]
# convert list of dictionaries to dataframe
df = pd.DataFrame(lst)
# Get the row / series that has max amount.
# idxmax returns an index for loc.
max_series_by_amount = df.loc[df['amount'].idxmax(axis="index")]
# Get only month and amount in a plain list
print(max_series_by_amount[["month", "amount"]].tolist())
['May', 550]
Please note that using pandas adds a substantial amount of dependencies to the project, that said, pandas is commonly imported anyway for data science or data manipulation tasks. Pierre D solutions here are definitively faster.
One possibility (among many):
from itertools import groupby
from operator import itemgetter
mo_total = {
k: sum([d.get('amount', 0) for d in v])
for k, v in groupby(sorted(lst, key=itemgetter('month')), key=itemgetter('month'))
}
>>> mo_total
{'August': 30, 'June': 200, 'May': 630}
>>> max(mo_total.items(), key=lambda kv: kv[1])
('May', 630)
Without itemgetter:
bymonth = lambda d: d.get('month')
mo_total = {
k: sum([d.get('amount', 0) for d in v])
for k, v in groupby(sorted(lst, key=bymonth), key=bymonth)
}
Yet another way, using defaultdict:
from collections import defaultdict
tot = defaultdict(int)
for d in lst:
tot[d['month']] += d.get('amount', 0)
>>> tot
defaultdict(int, {'June': 200, 'May': 630, 'August': 30})
>>> max(tot, key=lambda k: tot[k])
'May'
I have json format like this
{
"2015": [
{
"DayofWeek": 4,
"Date": "2015-02-06 00:00:00",
"Year": 2015,
"y": 43.2,
"x": 10.397
}
],
"2016": [
{
"DayofWeek": 4,
"Date": "2016-02-06 00:00:00",
"Year": 2016,
"y": 43.2,
"x": 10.397,
"Minute": 0
}
],
"2017": [
{
"DayofWeek": 4,
"Date": "2017-02-06 00:00:00",
"Year": 2017,
"y": 43.2,
"x": 10.397,
"Minute": 0
}
]
}
I am reading JSON file like this, and after reading json file; converting it to data frame
with open('sample.json') as json_data:
data = json.load(json_data)
df=pd.DataFrame([data])
Now, I want filter data based on certain input key value like DayofWeek and Year etc.
Example:
Case1:
if input value is DayofWeek=4, then I want filter all objects having DayofWeek=4.
Case2:
if input value is both DayofWeek=4 and year=2017, then I want filter all the 2017 years data from json having DayofWeek=4.
I have tried this code, but it is not working
filteredVal=df['2017']
filter_v={'2015':{'DayofYear':4}}
pd.Series(filter_v)
The Problem is, your json-values contains lists with dicts:
data
>>
{'2015': [{'DayofWeek': 4,
'Date': '2015-02-06 00:00:00',
'Year': 2015,
'y': 43.2,
'x': 10.397}],
'2016': [{'DayofWeek': 4,
'Date': '2016-02-06 00:00:00',
'Year': 2016,
'y': 43.2,
'x': 10.397,
'Minute': 0}],
'2017': [{'DayofWeek': 4,
'Date': '2017-02-06 00:00:00',
'Year': 2017,
'y': 43.2,
'x': 10.397,
'Minute': 0}]}
...pandas cannot process this (as far as I know).
But if every list contains just 1 element, you can convert it:
data_dict = {d: data[d][0] for d in data}
data_dict
>>
{'2015': {'DayofWeek': 4,
'Date': '2015-02-06 00:00:00',
'Year': 2015,
'y': 43.2,
'x': 10.397},
'2016': {'DayofWeek': 4,
'Date': '2016-02-06 00:00:00',
'Year': 2016,
'y': 43.2,
'x': 10.397,
'Minute': 0},
'2017': {'DayofWeek': 4,
'Date': '2017-02-06 00:00:00',
'Year': 2017,
'y': 43.2,
'x': 10.397,
'Minute': 0}}
Now you can make a DataFrame of it, with the index orientation:
df=pd.DataFrame.from_dict(data_dict, orient='index')
df
And access your elements:
Case1:
df[df['DayofWeek']==4]
Case2:
df[(df['DayofWeek']==4) & (df['Year']==2017)]
EDIT
If you have multiple elements inside the list, you can just create a list of all entries:
data_list = [v for d in data for v in data[d]]
df = pd.DataFrame(data_list)
Since you have a Year column, you probably don't even need the json-/dict-key, so I just skipped it. :-)
You can use list comprehension like this:
[data[x] for x in data if data[x][0]['DayofWeek'] == 4 and data[x][0]['Year'] == 2017]
This will give you a list of dictionary entries. If you want a filtered dictionary (to convert to a DataFrame), you can instead do something like this:
filtered_data = {}
filtered_data.update([(x, data[x]) for x in data if data[x][0]['DayofWeek'] == 4 and data[x][0]['Year'] == 2017])
I have a JSON response (sample below) that I'm trying to convert into a DataFrame. I've had several issues with the data being listed as columns (1 x 346), etc. I only need the 5 columns listed below:
area_name,
date,
month,
unemployment_rate,
year
Here's my code:
edd_ca_df = pd.DataFrame.from_dict(edd_ca, orient="index",
columns=["area_name", "month", "date", "year", "unemployment_rate"])
and here's a sample of the JSON response:
[[{'area_name': 'California',
'area_type': 'State',
'date': '1990-01-01T00:00:00.000',
'employment': '14099700',
'labor_force': '14953900',
'month': 'January',
'seasonally_adjusted_y_n': 'N',
'status_preliminary_final': 'Final',
'unemployment': '854200',
'unemployment_rate': '5.7',
'year': '1990'},
{'area_name': 'California',
'area_type': 'State',
'date': '1990-02-01T00:00:00.000',
'employment': '14206700',
'labor_force': '15049400',
'month': 'February',
'seasonally_adjusted_y_n': 'N',
'status_preliminary_final': 'Final',
'unemployment': '842800',
'unemployment_rate': '5.6',
'year': '1990'},
Any help would be greatly appreciated.
Since you have a list of dictionaries, this is as simple as passing all the data to a new DataFrame and specifying what columns you want to keep:
import pandas as pd
all_data = [{'area_name': 'California',
'area_type': 'State',
'date': '1990-01-01T00:00:00.000',
'employment': '14099700',
'labor_force': '14953900',
'month': 'January',
'seasonally_adjusted_y_n': 'N',
'status_preliminary_final': 'Final',
'unemployment': '854200',
'unemployment_rate': '5.7',
'year': '1990'},
{'area_name': 'California',
'area_type': 'State',
'date': '1990-02-01T00:00:00.000',
'employment': '14206700',
'labor_force': '15049400',
'month': 'February',
'seasonally_adjusted_y_n': 'N',
'status_preliminary_final': 'Final',
'unemployment': '842800',
'unemployment_rate': '5.6',
'year': '1990'}]
keep_columns = ['area_name','date','month','unemployment_rate','year']
df = pd.DataFrame(columns=keep_columns, data=all_data)
print(df)
Output
area_name date month unemployment_rate year
0 California 1990-01-01T00:00:00.000 January 5.7 1990
1 California 1990-02-01T00:00:00.000 February 5.6 1990
I need to extract dates from strings using regex in python and the dates can be in one of many formats, and between some random text.
The date formats are:
04/20/2009; 04/20/09; 4/20/09; 4/3/09
Mar-20-2009; Mar 20, 2009; March 20, 2009; Mar. 20, 2009; Mar 20 2009;
20 Mar 2009; 20 March 2009; 20 Mar. 2009; 20 March, 2009
Mar 20th, 2009; Mar 21st, 2009; Mar 22nd, 2009
Feb 2009; Sep 2009; Oct 2010
6/2008; 12/2009
2009; 2010
After extract the dates I need to sort them ascending.
I've tried to use those 6 regex patterns but it seems that it's not doing all the job.
pattern1 = r'((?:\d{1,2}[- ,./]*)(?:Jan|Feb|Mar|Apr|May|Jun|Jul|Aug|Sep|Oct|Nov|Dec)[a-z]*[- ,./]*\d{4})'
pattern2 = r'((?:Jan|Feb|Mar|Apr|May|Jun|Jul|Aug|Sep|Oct|Nov|Dec)[a-z]*[ ,./-]*\d{1,2}[ ,./-]*\d{4})'
pattern3 = r'((?:Jan|Feb|Mar|Apr|May|Jun|Jul|Aug|Sep|Oct|Nov|Dec)[a-z]*[ ,./-]*\d{4})'
pattern4 = r'((?:\d{1,2}[/-]\d{1,2}[/-](?:\d{4}|\d{2})))'
pattern5 = r'(?:(\s\d{2}[/-](?:\d{4})))'
pattern6 = r'(?:\d{4})'
It might be useful to set up some intermediate variables.
import re
short_month_names = (
'Jan', 'Feb', 'Mar', 'Apr', 'May', 'Jun',
'Jul', 'Aug', 'Sep', 'Oct', 'Nov', 'Dec'
)
long_month_names = (
'January', 'February', 'March', 'April', 'May', 'June', 'July',
'August', 'September', 'October', 'November', 'December'
)
short_month_cap = '(?:' + '|'.join(short_month_names) + ')'
long_month_cap = '(?:' + '|'.join(long_month_names) + ')'
short_num_month_cap = '(?:[1-9]|1[12])'
long_num_month_cap = '(?:0[1-9]|1[12])'
long_day_cap = '(?:0[1-9]|[12][0-9]|3[01])'
short_day_cap = '(?:[1-9]|[12][0-9]|3[01])'
long_year_cap = '(?:[0-9]{3}[1-9]|[0-9]{2}[1-9][0-9]|[0-9][1-9][0-9]{2}|[1-9][0-9]{3})'
short_year_cap = '(?:[0-9][0-9])'
ordinal_day = '(?:2?1st|2?2nd|2?3rd|[12]?[4-9]th|1[123]th|[123]0th|31st)'
formats = (
r'(?P<month_0>{lnm}|{snm})/(?P<day_0>{ld}|{sd})/(?P<year_0>{sy}|{ly})',
r'(?P<month_1>{sm})\-(?P<day_1>{ld}|{sd})\-(?P<year_1>{ly})',
r'(?P<month_2>{sm}|{lm})(?:\.\s+|\s*)(?P<day_2>{ld}|{sd})(?:,\s+|\s*)(?P<year_2>{ly})',
r'(?P<day_3>{ld}|{sd})(?:[\.,]\s+|\s*)(?P<month_3>{lm}|{sm})(?:[\.,]\s+|\s*)(?P<year_3>{ly})',
r'(?P<month_4>{lm}|{sm})\s+(?P<year_4>{ly})',
r'(?P<month_5>{lnm}|{snm})/(?P<year_5>{ly})',
r'(?P<year_6>{ly})',
r'(?P<month_6>{sm})\s+(?P<day_4>(?={od})[0-9][0-9]?)..,\s*(?P<year_7>{ly})'
)
_pattern = '|'.join(
i.format(
sm=short_month_cap, lm=long_month_cap, snm=short_num_month_cap,
lnm=long_num_month_cap, ld=long_day_cap, sd=short_day_cap,
ly=long_year_cap, sy=short_year_cap, od=ordinal_day
) for i in formats
)
pattern = re.compile(_pattern)
def get_fields(match):
if not match:
return None
return {
k[:-2]: v
for k, v in match.groupdict().items()
if v is not None
}
tests = r'''04/20/2009; 04/20/09; 4/20/09; 4/3/09
Mar-20-2009; Mar 20, 2009; March 20, 2009; Mar. 20, 2009; Mar 20 2009
20 Mar 2009; 20 March 2009; 20 Mar. 2009; 20 March, 2009
Mar 20th, 2009; Mar 21st, 2009; Mar 22nd, 2009
Feb 2009; Sep 2009; Oct 2010
6/2008; 12/2009
2009; 2010'''
for test_line in tests.split('\n'):
for test in test_line.split('; '):
print('{!r}: {!r}'.format(test, get_fields(pattern.fullmatch(test))))
print('')
Which outputs:
'04/20/2009': {'month': '04', 'day': '20', 'year': '2009'}
'04/20/09': {'month': '04', 'day': '20', 'year': '09'}
'4/20/09': {'month': '4', 'day': '20', 'year': '09'}
'4/3/09': {'month': '4', 'day': '3', 'year': '09'}
'Mar-20-2009': {'month': 'Mar', 'day': '20', 'year': '2009'}
'Mar 20, 2009': {'month': 'Mar', 'day': '20', 'year': '2009'}
'March 20, 2009': {'month': 'March', 'day': '20', 'year': '2009'}
'Mar. 20, 2009': {'month': 'Mar', 'day': '20', 'year': '2009'}
'Mar 20 2009': {'month': 'Mar', 'day': '20', 'year': '2009'}
'20 Mar 2009': {'day': '20', 'month': 'Mar', 'year': '2009'}
'20 March 2009': {'day': '20', 'month': 'March', 'year': '2009'}
'20 Mar. 2009': {'day': '20', 'month': 'Mar', 'year': '2009'}
'20 March, 2009': {'day': '20', 'month': 'March', 'year': '2009'}
'Mar 20th, 2009': {'month': 'Mar', 'day': '20', 'year': '2009'}
'Mar 21st, 2009': {'month': 'Mar', 'day': '21', 'year': '2009'}
'Mar 22nd, 2009': {'month': 'Mar', 'day': '22', 'year': '2009'}
'Feb 2009': {'month': 'Feb', 'year': '2009'}
'Sep 2009': {'month': 'Sep', 'year': '2009'}
'Oct 2010': {'month': 'Oct', 'year': '2010'}
'6/2008': {'month': '6', 'year': '2008'}
'12/2009': {'month': '12', 'year': '2009'}
'2009': {'year': '2009'}
'2010': {'year': '2010'}
The main part is the formats variable, where all the different formats are defined. It matches slightly more than what is defined, and can easily be extended.
The overall pattern ends up being:
'(?P<month_0>(?:0[1-9]|1[12])|(?:[1-9]|1[12]))/(?P<day_0>(?:0[1-9]|[12][0-9]|3[01])|(?:[1-9]|[12][0-9]|3[01]))/(?P<year_0>(?:[0-9][0-9])|(?:[0-9]{3}[1-9]|[0-9]{2}[1-9][0-9]|[0-9][1-9][0-9]{2}|[1-9][0-9]{3}))|(?P<month_1>(?:Jan|Feb|Mar|Apr|May|Jun|Jul|Aug|Sep|Oct|Nov|Dec))\\-(?P<day_1>(?:0[1-9]|[12][0-9]|3[01])|(?:[1-9]|[12][0-9]|3[01]))\\-(?P<year_1>(?:[0-9]{3}[1-9]|[0-9]{2}[1-9][0-9]|[0-9][1-9][0-9]{2}|[1-9][0-9]{3}))|(?P<month_2>(?:Jan|Feb|Mar|Apr|May|Jun|Jul|Aug|Sep|Oct|Nov|Dec)|(?:January|February|March|April|May|June|July|August|September|October|November|December))(?:\\.\\s+|\\s*)(?P<day_2>(?:0[1-9]|[12][0-9]|3[01])|(?:[1-9]|[12][0-9]|3[01]))(?:,\\s+|\\s*)(?P<year_2>(?:[0-9]{3}[1-9]|[0-9]{2}[1-9][0-9]|[0-9][1-9][0-9]{2}|[1-9][0-9]{3}))|(?P<day_3>(?:0[1-9]|[12][0-9]|3[01])|(?:[1-9]|[12][0-9]|3[01]))(?:[\\.,]\\s+|\\s*)(?P<month_3>(?:January|February|March|April|May|June|July|August|September|October|November|December)|(?:Jan|Feb|Mar|Apr|May|Jun|Jul|Aug|Sep|Oct|Nov|Dec))(?:[\\.,]\\s+|\\s*)(?P<year_3>(?:[0-9]{3}[1-9]|[0-9]{2}[1-9][0-9]|[0-9][1-9][0-9]{2}|[1-9][0-9]{3}))|(?P<month_4>(?:January|February|March|April|May|June|July|August|September|October|November|December)|(?:Jan|Feb|Mar|Apr|May|Jun|Jul|Aug|Sep|Oct|Nov|Dec))\\s+(?P<year_4>(?:[0-9]{3}[1-9]|[0-9]{2}[1-9][0-9]|[0-9][1-9][0-9]{2}|[1-9][0-9]{3}))|(?P<month_5>(?:0[1-9]|1[12])|(?:[1-9]|1[12]))/(?P<year_5>(?:[0-9]{3}[1-9]|[0-9]{2}[1-9][0-9]|[0-9][1-9][0-9]{2}|[1-9][0-9]{3}))|(?P<year_6>(?:[0-9]{3}[1-9]|[0-9]{2}[1-9][0-9]|[0-9][1-9][0-9]{2}|[1-9][0-9]{3}))|(?P<month_6>(?:Jan|Feb|Mar|Apr|May|Jun|Jul|Aug|Sep|Oct|Nov|Dec))\\s+(?P<day_4>(?=(?:2?1st|2?2nd|2?3rd|[12]?[4-9]th|1[123]th|[123]0th|31st))[0-9][0-9]?)..,\\s*(?P<year_7>(?:[0-9]{3}[1-9]|[0-9]{2}[1-9][0-9]|[0-9][1-9][0-9]{2}|[1-9][0-9]{3}))'
Which would have been virtually impossible to write by hand.
The bounds for the "between random text" can be added around _pattern.
I would suggest _pattern = r'\b(?:{})\b'.format(_pattern).
What wrong with this code, return empty list?
week = []
for d in week:
day_num = calendar.weekday(d.year,d.month,d.day)
day_name = calendar.day_name[day_num]
daydate = { "day_name":day_name,
"day":d.day,
"month":d.month,
"year":d.year,
}
week.append(daydate)
return week
Because the list week is empty initially, the for loop is iterated zero times.
Your week list is set as [] just before the for statement, so the loop doesn't have any element to iterate on. You have to either:
remove this week = [] if week has already been declared
add elements in the list.
fixed your code. It's maybe not on week that you want to iterate but on another variable.
import calendar
from datetime import datetime
from datetime import timedelta
def generateDays(start_date,weeks):
days=7*weeks
week = []
for day in np.arange(days):
a_date = pd.to_datetime(start_date + timedelta(days=int(day)))
day_num = calendar.weekday(a_date.year,a_date.month,a_date.day)
day_name = calendar.day_name[day_num]
daydate = { "day_name":day_name,
"day":a_date.day,
"month":a_date.month,
"year":a_date.year,
}
week.append(daydate)
return week
print(generateDays(date.today(),2))
output
[{'day_name': 'Wednesday', 'day': 16, 'month': 6, 'year': 2021}, {'day_name': 'Thursday', 'day': 17, 'month': 6, 'year': 2021}, {'day_name': 'Friday', 'day': 18, 'month': 6, 'year': 2021}, {'day_name': 'Saturday', 'day': 19, 'month': 6, 'year': 2021}, {'day_name': 'Sunday', 'day': 20, 'month': 6, 'year': 2021}, {'day_name': 'Monday', 'day': 21, 'month': 6, 'year': 2021}, {'day_name': 'Tuesday', 'day': 22, 'month': 6, 'year': 2021}, {'day_name': 'Wednesday', 'day': 23, 'month': 6, 'year': 2021}, {'day_name': 'Thursday', 'day': 24, 'month': 6, 'year': 2021}, {'day_name': 'Friday', 'day': 25, 'month': 6, 'year': 2021}, {'day_name': 'Saturday', 'day': 26, 'month': 6, 'year': 2021}, {'day_name': 'Sunday', 'day': 27, 'month': 6, 'year': 2021}, {'day_name': 'Monday', 'day': 28, 'month': 6, 'year': 2021}, {'day_name': 'Tuesday', 'day': 29, 'month': 6, 'year': 2021}]