I have a data frame with info like:
month year date well_number depth_to_water
April 2007 4/1/07 1 48.60
August 2007 8/1/07 2 80.20
December 2007 12/1/07 EM3 37.50
February 2007 2/1/07 27 32.00
February 2008 2/1/08 27 40.00
I'm trying to create a new column with the year-to-year differences in each month's depth to water, so for 27: 32-40= -8
I've grouped the data frame, i.e.
grouped_dw = davis_wells.groupby(['well_number', 'month','year'], sort=True)
Which gives me exactly the sorting I need to theoretically just iterate through
well_number month year date depth_to_water
1 April 2007 4/1/07 48.60
2008 4/1/08 62.30
2009 4/1/09 55.90
2010 4/1/10 36.20
2011 4/1/11 33.90
Out of which I'm trying to get:
well_number month year date depth_to_water change
1 April 2007 4/1/07 50 NaN
2008 4/1/08 60 -10
2009 4/1/09 55 5
2010 4/1/10 70 -15
2011 4/1/11 30 40
So I tried
grouped_dw['change'] = grouped_dw.depth_to_water(-1) - grouped_dw.depth_to_water
Which throws an error. Any ideas? Pretty sure I'm just not understanding how hierarchical groupedby Dataframes work.
Thanks!
EDIT:
I used sort, which gives me almost everything I need.. except I need it to give a null value when skipping to the next month.
davis_wells = davis_wells.sort(['well_number', 'month'])
davis_wells['change'] = davis_wells.depth_to_water.shift(1) - davis_wells.depth_to_water
Related
Month Year Open High Low Close/Price Volume
6 2019 86.78 87.11 86.06 86.55 1507828
6 2019 86.63 87.23 84.81 85.06 2481284
6 2019 85.38 85.81 84.75 85.33 2034693
6 2019 85.65 86.86 85.13 86.43 1394847
6 2019 86.66 87.74 86.66 87.55 3025379
7 2019 88.84 89.72 87.77 88.45 4017249
7 2019 89.21 90 87.95 88.87 2237183
7 2019 89.14 91.08 89.14 90.67 1647124
7 2019 90.39 90.95 89.07 90.59 3227673
I want to get the monthly average of: Open High Low Close/Price
How do i set two values (Month, Year) as parameters for getting a value that is in another column?
df = pd.read_excel('DatosUnited.xlsx')
month = df.groupby('Month')
year = df.groupby('Year')
june2019 = month.get_group("6")
year2019 = year.get_group('2019')
I tried something like this, but i dont know how to use both as a filter simultaneously
You can use .groupby() with multiple columns, and then you can use .mean() to get the desired averages:
df.groupby(["Month", "Year"]).mean()
This outputs:
Open High Low Close/Price Volume
Month Year
6 2019 86.220 86.9500 85.4820 86.184 2088806.20
7 2019 89.395 90.4375 88.4825 89.645 2782307.25
I have a below dataframe:
OUTLET_UNQ_CODE Category_Code month
0 2018020000065 SSSI January 21
1 2018020000066 SSSI January 21
2 2018020000067 SSSI January 21
...
512762 2021031641195 CH March 21
512763 2021031642445 CH March 21
512764 2021031643357 GM March 21
512765 2021031643863 GM March 21
there are few OUTLET_UNQ_CODE who have changed their Category_Code within a month and next month as well. I need to count the number of hops every outlet has done. For ex: If 2021031643863 had Category_code GM in Jan 21 and CH in Jan 21 again, CH in Feb and Kirana in March. This will be counted as 2 hops.
This is what i have tried:
s=pd.to_numeric(new_df.Category_Code,errors='coerce')
df=new_df.assign(New=s.bfill())[s.isnull()].groupby('OUTLET_UNQ_CODE').agg({'Category_Code':list})
df.reset_index(inplace=True)
O/P is:
OUTLET_UNQ_CODE Category_Code
0 2021031643863 [GM,CH,CH,Kirana]
regardless if there is maybe a better way starting from the beginning, to achieve the goal based on your output, here is a piece of code to get the number of changes in the list:
cat_lst = ['GM','CH','CH','Kirana']
a = sum((1 for i,x in enumerate(cat_lst[:-1]) if x!= cat_lst[i+1]))
# in this case the result of a is 2
I have a dataframe with historical market caps for which I need to compute their 5-year compound annual growth rates (CAGRs). However, the dataframe has hundreds of companies with 20 years of values each, so I need to be able to isolate each company's data to compute their CAGRs. How do I go about doing this?
The function to calculate a CAGR is: (end/start)^(1/# years)-1. I have never used .groupby() or .apply(), so I don't know how to implement the CAGR equation for rolling values.
Here is a screenshot of part of the dataframe so you have a visual representation of what I am trying to use:
Screeshot of dataframe.
Any guidance would be greatly appreciated!
Assuming there is 1 value per company per year. You can reduce the date to year. This is a lot simpler. No need for groupby or apply.
Say your dataframe is name df. First, reduce date to year:
df['year'] = df['Date'].dt.year
Second, add year+5
df['year+5'] = df['year'] + 5
Third, merge the 'df' with itself:
df_new = pandas.merge(df, df, how='inner', left_on=['Instrument', 'year'], right_on=['Instrument','year+5'], suffixes=['_start', '_end'])
Finally, calculate rolling CAGR
df_new['CAGR'] = (df_new['Company Market Cap_end']/df_new['Company Market Cap_start'])**(0.2)-1
Setting up a toy example:
import numpy as np
import pandas as pd
idx_level_0 = np.repeat(["company1", "company2", "company3"], 5)
idx_level_1 = np.tile([2015, 2016, 2017, 2018, 2019], 3)
values = np.random.randint(low=1, high=100, size=15)
df = pd.DataFrame({"values": values}, index=[idx_level_0, idx_level_1])
df.index.names = ["company", "year"]
print(df)
values
company year
company1 2015 19
2016 61
2017 87
2018 55
2019 46
company2 2015 1
2016 68
2017 50
2018 93
2019 84
company3 2015 11
2016 84
2017 54
2018 21
2019 55
I suggest to use groupby to group by individual companies. You then could apply your computation via a lambda function. The result is basically a one-liner.
# actual computation for a two-year period
cagr_period = 2
df["cagr"] = df.groupby("company").apply(lambda x, period: ((x.pct_change(period) + 1) ** (1/period)) - 1, cagr_period)
print(df)
values cagr
company year
company1 2015 19 NaN
2016 61 NaN
2017 87 1.139848
2018 55 -0.050453
2019 46 -0.272858
company2 2015 1 NaN
2016 68 NaN
2017 50 6.071068
2018 93 0.169464
2019 84 0.296148
company3 2015 11 NaN
2016 84 NaN
2017 54 1.215647
2018 21 -0.500000
2019 55 0.009217
I want to create a dataframe that is grouped by region and date which shows the average age of a region during specific years. so my coloumns would look something like
region, year, average age
so far I have:
#specify aggregation functions to column'age'
ageAverage = {'age':{'average age':'mean'}}
#groupby and apply functions
ageDataFrame = data.groupby(['Region', data.Date.dt.year]).agg(ageAverage)
This works great, but how can I make it so that I only group data from specific years? say for example between 2010 and 2015?
You need filter first by between:
ageDataFrame = (data[data.Date.dt.year.between(2010, 2015)]
.groupby(['Region', data.Date.dt.year])
.agg(ageAverage))
Also in last version of pandas 0.22.0 get:
SpecificationError: cannot perform renaming for age with a nested dictionary
Correct solution is specify column in list after groupby and aggregate by tuple - first value is new column name and second aggregate function:
np.random.seed(123)
rng = pd.date_range('2009-04-03', periods=10, freq='13M')
data = pd.DataFrame({'Date': rng,
'Region':['reg1'] * 3 + ['reg2'] * 7,
'average age': np.random.randint(20, size=10)})
print (data)
Date Region average age
0 2009-04-30 reg1 13
1 2010-05-31 reg1 2
2 2011-06-30 reg1 2
3 2012-07-31 reg2 6
4 2013-08-31 reg2 17
5 2014-09-30 reg2 19
6 2015-10-31 reg2 10
7 2016-11-30 reg2 1
8 2017-12-31 reg2 0
9 2019-01-31 reg2 17
ageAverage = {('age','mean')}
#groupby and apply functions
ageDataFrame = (data[data.Date.dt.year.between(2010, 2015)]
.groupby(['Region', data.Date.dt.year])['average age']
.agg(ageAverage))
print (ageDataFrame)
age
Region Date
reg1 2010 2
2011 2
reg2 2012 6
2013 17
2014 19
2015 10
Two variations using #jezrael's data (thx)
These are very close to what #jezrael has already shown. Only view this as a demonstration of what else can be done. As pointed out in the comments by #jezrael, it is better to pre-filter first as it reduces overall processing.
pandas.IndexSlice
instead of prefiltering with between
data.groupby(
['Region', data.Date.dt.year]
)['average age'].agg(
[('age', 'mean')]
).loc[pd.IndexSlice[:, 2010:2015], :]
age
Region Date
reg1 2010 2
2011 2
reg2 2012 6
2013 17
2014 19
2015 10
between as part of the groupby
data.groupby(
[data.Date.dt.year.between(2010, 2015),
'Region', data.Date.dt.year]
)['average age'].agg(
[('age', 'mean')]
).loc[True]
age
Region Date
reg1 2010 2
2011 2
reg2 2012 6
2013 17
2014 19
2015 10
I'm working in Python and I have a Pandas DataFrame of Uber data from New York City. A part of the DataFrame looks like this:
Year Week_Number Total_Dispatched_Trips
2015 51 1,109
2015 5 54,380
2015 50 8,989
2015 51 1,025
2015 21 10,195
2015 38 51,957
2015 43 266,465
2015 29 66,139
2015 40 74,321
2015 39 3
2015 50 854
As it is right now, the same week appears multiple times for each year. I want to sum the values for "Total_Dispatched_Trips" for every week for each year. I want each week to appear only once per year. (So week 51 can't appear multiple times for year 2015 etc.). How do I do this? My dataset is over 3k rows, so I would prefer not to do this manually.
Thanks in advance.
okidoki here is it, borrowing on Convert number strings with commas in pandas DataFrame to float
import locale
from locale import atof
locale.setlocale(locale.LC_NUMERIC, '')
df['numeric_trip'] = pd.to_numeric(df.Total_Dispatched_trips.apply(atof), errors = 'coerce')
df.groupby(['Year', 'Week_number']).numeric_trip.sum()