Month Year Open High Low Close/Price Volume
6 2019 86.78 87.11 86.06 86.55 1507828
6 2019 86.63 87.23 84.81 85.06 2481284
6 2019 85.38 85.81 84.75 85.33 2034693
6 2019 85.65 86.86 85.13 86.43 1394847
6 2019 86.66 87.74 86.66 87.55 3025379
7 2019 88.84 89.72 87.77 88.45 4017249
7 2019 89.21 90 87.95 88.87 2237183
7 2019 89.14 91.08 89.14 90.67 1647124
7 2019 90.39 90.95 89.07 90.59 3227673
I want to get the monthly average of: Open High Low Close/Price
How do i set two values (Month, Year) as parameters for getting a value that is in another column?
df = pd.read_excel('DatosUnited.xlsx')
month = df.groupby('Month')
year = df.groupby('Year')
june2019 = month.get_group("6")
year2019 = year.get_group('2019')
I tried something like this, but i dont know how to use both as a filter simultaneously
You can use .groupby() with multiple columns, and then you can use .mean() to get the desired averages:
df.groupby(["Month", "Year"]).mean()
This outputs:
Open High Low Close/Price Volume
Month Year
6 2019 86.220 86.9500 85.4820 86.184 2088806.20
7 2019 89.395 90.4375 88.4825 89.645 2782307.25
Related
I am really stuck on how to approach adding columns to Pandas dynamically. I've been trying to search for an answer to work through this, however, I am afraid when searching I may also be using the wrong terminology to summarize what I am attempting to do.
I have a dataframe returned from a query that looks like the following:
department action date
marketing close 09-01-2017
marketing close 07-01-2018
marketing close 06-01-2017
marketing close 10-21-2019
marketing open 08-01-2018
marketing other 07-14-2018
sales open 02-01-2019
sales open 02-01-2017
sales close 02-22-2019
The ultimate goal is I need a count of the types of actions grouped within particular date ranges.
My DESIRED output is something along the lines of:
department 01/01/2017-12/31/2017 01/01/2018-12/31/2018 01/01/2019-12/31/2019
open close other open close other open close other
marketing 0 2 0 1 1 1 0 1 0
sales 1 0 0 0 0 0 1 1 0
"Department" would be my index, then the contents would be filtered by date ranges specified in a list I provide, followed by the action taken (with counts). Being newer to this, I am confused as to what approach I should take - for example should I use Python (should I be looping or iterating), or should the heavy lifting be done in PANDAS. If in PANDAS, I am having difficulty determining what function to use (I've been looking at get_dummy() etc.).
I'd imagine this would be accomplished with either 1. Some type or FOR loop iterating through, 2. Adding a column to the dataframe based on the list then filtering the data underneath based on the value(s), or 3. using a function I am not aware of in Pandas
I have explained more of my thought process in this question, but I am not sure if the question is unclear which is why it may be unanswered.
Building a dataframe with dynamic date ranges using filtered results from another dataframe
There are quite a few concepts you need at once here.
First you dont yet have the count. From your desired output I took you want it yearly but you can specify any time frame you want. Then just count with groupby() and count()
In [66]: df2 = df.groupby([pd.to_datetime(df.date).dt.year, "action", "department"]).count().squeeze().rename("count")
Out[66]:
date action department
2017 close marketing 2
open sales 1
2018 close marketing 1
open marketing 1
other marketing 1
2019 close marketing 1
sales 1
open sales 1
Name: count, dtype: int64
The squeeze() and rename() are there because afterwards both the count column and the year would be called date and you get a name conflict. You could equivalently use rename(columns={'date': 'count'}) and not cast to a Series.
The second step is a pivot_table. This creates column names from values. Because there are combinations of date and action without a corresponding value, you need pivot_table.
In [62]: df2.pivot_table(index="department", columns=["date", "action"])
Out[62]:
count
date 2017 2018 2019
action close open close open other close open
department
marketing 2.0 NaN 1.0 1.0 1.0 1.0 NaN
sales NaN 1.0 NaN NaN NaN 1.0 1.0
Because NaN is internally representated as floating piont, your counts were also converted to floating point. To fix that, just append fillna and convert back to int.
In [65]: df2.reset_index().pivot_table(index="department", columns=["date", "action"]).fillna(0).astype(int)
Out[65]:
count
date 2017 2018 2019
action close open close open other close open
department
marketing 2 0 1 1 1 1 0
sales 0 1 0 0 0 1 1
To get exactly you output you would need to modify pd.to_datetime(df.date).dt.year. You can do this with strftime (https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.Series.dt.strftime.html). Furthermore the column ["2017", "other"] was dropped because there was no value. If this creates problems you need to include the values beforehand. After the first step a reindex and a fillna should do the trick.
EDIT: Yes it does
In [77]: new_index = pd.MultiIndex.from_product([[2017, 2018, 2019], ["close", "open", "other"], ['marketing', 'sales']], names=['date', 'action', 'department'])
...:
In [78]: df3 = df2.reindex(new_index).fillna(0).astype(int).reset_index()
Out[78]:
date action department count
0 2017 close marketing 2
1 2017 close sales 0
2 2017 open marketing 0
3 2017 open sales 1
4 2017 other marketing 0
5 2017 other sales 0
6 2018 close marketing 1
.. ... ... ... ...
11 2018 other sales 0
12 2019 close marketing 1
13 2019 close sales 1
14 2019 open marketing 0
15 2019 open sales 1
16 2019 other marketing 0
17 2019 other sales 0
In [79]: df3.pivot_table(index="department", columns=["date", "action"])
Out[79]:
count
date 2017 2018 2019
action close open other close open other close open other
department
marketing 2 0 0 1 1 1 1 0 0
sales 0 1 0 0 0 0 1 1 0
I want to create a dataframe that is grouped by region and date which shows the average age of a region during specific years. so my coloumns would look something like
region, year, average age
so far I have:
#specify aggregation functions to column'age'
ageAverage = {'age':{'average age':'mean'}}
#groupby and apply functions
ageDataFrame = data.groupby(['Region', data.Date.dt.year]).agg(ageAverage)
This works great, but how can I make it so that I only group data from specific years? say for example between 2010 and 2015?
You need filter first by between:
ageDataFrame = (data[data.Date.dt.year.between(2010, 2015)]
.groupby(['Region', data.Date.dt.year])
.agg(ageAverage))
Also in last version of pandas 0.22.0 get:
SpecificationError: cannot perform renaming for age with a nested dictionary
Correct solution is specify column in list after groupby and aggregate by tuple - first value is new column name and second aggregate function:
np.random.seed(123)
rng = pd.date_range('2009-04-03', periods=10, freq='13M')
data = pd.DataFrame({'Date': rng,
'Region':['reg1'] * 3 + ['reg2'] * 7,
'average age': np.random.randint(20, size=10)})
print (data)
Date Region average age
0 2009-04-30 reg1 13
1 2010-05-31 reg1 2
2 2011-06-30 reg1 2
3 2012-07-31 reg2 6
4 2013-08-31 reg2 17
5 2014-09-30 reg2 19
6 2015-10-31 reg2 10
7 2016-11-30 reg2 1
8 2017-12-31 reg2 0
9 2019-01-31 reg2 17
ageAverage = {('age','mean')}
#groupby and apply functions
ageDataFrame = (data[data.Date.dt.year.between(2010, 2015)]
.groupby(['Region', data.Date.dt.year])['average age']
.agg(ageAverage))
print (ageDataFrame)
age
Region Date
reg1 2010 2
2011 2
reg2 2012 6
2013 17
2014 19
2015 10
Two variations using #jezrael's data (thx)
These are very close to what #jezrael has already shown. Only view this as a demonstration of what else can be done. As pointed out in the comments by #jezrael, it is better to pre-filter first as it reduces overall processing.
pandas.IndexSlice
instead of prefiltering with between
data.groupby(
['Region', data.Date.dt.year]
)['average age'].agg(
[('age', 'mean')]
).loc[pd.IndexSlice[:, 2010:2015], :]
age
Region Date
reg1 2010 2
2011 2
reg2 2012 6
2013 17
2014 19
2015 10
between as part of the groupby
data.groupby(
[data.Date.dt.year.between(2010, 2015),
'Region', data.Date.dt.year]
)['average age'].agg(
[('age', 'mean')]
).loc[True]
age
Region Date
reg1 2010 2
2011 2
reg2 2012 6
2013 17
2014 19
2015 10
I'm working in Python and I have a Pandas DataFrame of Uber data from New York City. A part of the DataFrame looks like this:
Year Week_Number Total_Dispatched_Trips
2015 51 1,109
2015 5 54,380
2015 50 8,989
2015 51 1,025
2015 21 10,195
2015 38 51,957
2015 43 266,465
2015 29 66,139
2015 40 74,321
2015 39 3
2015 50 854
As it is right now, the same week appears multiple times for each year. I want to sum the values for "Total_Dispatched_Trips" for every week for each year. I want each week to appear only once per year. (So week 51 can't appear multiple times for year 2015 etc.). How do I do this? My dataset is over 3k rows, so I would prefer not to do this manually.
Thanks in advance.
okidoki here is it, borrowing on Convert number strings with commas in pandas DataFrame to float
import locale
from locale import atof
locale.setlocale(locale.LC_NUMERIC, '')
df['numeric_trip'] = pd.to_numeric(df.Total_Dispatched_trips.apply(atof), errors = 'coerce')
df.groupby(['Year', 'Week_number']).numeric_trip.sum()
I have a data frame with info like:
month year date well_number depth_to_water
April 2007 4/1/07 1 48.60
August 2007 8/1/07 2 80.20
December 2007 12/1/07 EM3 37.50
February 2007 2/1/07 27 32.00
February 2008 2/1/08 27 40.00
I'm trying to create a new column with the year-to-year differences in each month's depth to water, so for 27: 32-40= -8
I've grouped the data frame, i.e.
grouped_dw = davis_wells.groupby(['well_number', 'month','year'], sort=True)
Which gives me exactly the sorting I need to theoretically just iterate through
well_number month year date depth_to_water
1 April 2007 4/1/07 48.60
2008 4/1/08 62.30
2009 4/1/09 55.90
2010 4/1/10 36.20
2011 4/1/11 33.90
Out of which I'm trying to get:
well_number month year date depth_to_water change
1 April 2007 4/1/07 50 NaN
2008 4/1/08 60 -10
2009 4/1/09 55 5
2010 4/1/10 70 -15
2011 4/1/11 30 40
So I tried
grouped_dw['change'] = grouped_dw.depth_to_water(-1) - grouped_dw.depth_to_water
Which throws an error. Any ideas? Pretty sure I'm just not understanding how hierarchical groupedby Dataframes work.
Thanks!
EDIT:
I used sort, which gives me almost everything I need.. except I need it to give a null value when skipping to the next month.
davis_wells = davis_wells.sort(['well_number', 'month'])
davis_wells['change'] = davis_wells.depth_to_water.shift(1) - davis_wells.depth_to_water
With Pandas I have created a DataFrame from an imported .csv file (this file is generated through simulation). The DataFrame consists of half-hourly energy consumption data for a single year. I have already created a DateTimeindex for the dates.
I would like to be able to reformat this data into average hourly week and weekend profile results. With the week profile excluding holidays.
DataFrame:
Date_Time Equipment:Electricity:LGF Equipment:Electricity:GF
01/01/2000 00:30 0.583979872 0.490327348
01/01/2000 01:00 0.583979872 0.490327348
01/01/2000 01:30 0.583979872 0.490327348
01/01/2000 02:00 0.583979872 0.490327348
I found an example (Getting the average of a certain hour on weekdays over several years in a pandas dataframe) that explains doing this for several years, but not explicitly for a week (without holidays) and weekend.
I realised that there is no resampling techniques in Pandas that do this directly, I used several aliases (http://pandas.pydata.org/pandas-docs/stable/timeseries.html#offset-aliases) for creating Monthly and Daily profiles.
I was thinking of using the business day frequency and create a new dateindex with working days and compare that to my DataFrame datetimeindex for every half hour. Then return values for working days and weekend days when true or false respectively to create a new dataset, but am not sure how to do this.
PS; I am just getting into Python and Pandas.
Dummy data (for future reference, more likely to get an answer if you post some in a copy-paste-able form)
df = pd.DataFrame(data={'a':np.random.randn(1000)},
index=pd.date_range(start='2000-01-01', periods=1000, freq='30T'))
Here's an approach. First define a US (or modify as appropriate) business day offset with holidays, and generate and range covering your dates.
from pandas.tseries.holiday import USFederalHolidayCalendar
from pandas.tseries.offsets import CustomBusinessDay
bday_us = CustomBusinessDay(calendar=USFederalHolidayCalendar())
bday_over_df = pd.date_range(start=df.index.min().date(),
end=df.index.max().date(), freq=bday_us)
Then, develop your two grouping columns. An hour column is easy.
df['hour'] = df.index.hour
For weekday/weekend/holiday, define a function to group the data.
def group_day(date):
if date.weekday() in [5,6]:
return 'weekend'
elif date.date() in bday_over_df:
return 'weekday'
else:
return 'holiday'
df['day_group'] = df.index.map(group_day)
Then, just group by the two columns as you wish.
In [140]: df.groupby(['day_group', 'hour']).sum()
Out[140]:
a
day_group hour
holiday 0 1.890621
1 -0.029606
2 0.255001
3 2.837000
4 -1.787479
5 0.644113
6 0.407966
7 -1.798526
8 -0.620614
9 -0.567195
10 -0.822207
11 -2.675911
12 0.940091
13 -1.601885
14 1.575595
15 1.500558
16 -2.512962
17 -1.677603
18 0.072809
19 -1.406939
20 2.474293
21 -1.142061
22 -0.059231
23 -0.040455
weekday 0 9.192131
1 2.759302
2 8.379552
3 -1.189508
4 3.796635
5 3.471802
... ...
18 -5.217554
19 3.294072
20 -7.461023
21 8.793223
22 4.096128
23 -0.198943
weekend 0 -2.774550
1 0.461285
2 1.522363
3 4.312562
4 0.793290
5 2.078327
6 -4.523184
7 -0.051341
8 0.887956
9 2.112092
10 -2.727364
11 2.006966
12 7.401570
13 -1.958666
14 1.139436
15 -1.418326
16 -2.353082
17 -1.381131
18 -0.568536
19 -5.198472
20 -3.405137
21 -0.596813
22 1.747980
23 -6.341053
[72 rows x 1 columns]