I need to find the variability from the long-term mean for monthly data from 1991 to 2021. I have data that looks like this that is a 204,3 size:
dfavgs =
plant_name month power_kwh
0 ARIZONA I 1 10655.989885
1 ARIZONA I 2 9789.542672
2 ARIZONA I 3 7889.403154
3 ARIZONA I 4 7965.595843
4 ARIZONA I 5 9299.316756
.. ... ... ...
199 SANTANA II 8 16753.999870
200 SANTANA II 9 17767.383616
201 SANTANA II 10 17430.005363
202 SANTANA II 11 16628.784139
203 SANTANA II 12 15167.085560
My large monthly by year df looks like this with size 6137,4:
dfmonthlys:
plant_name year month power_kwh
0 ARIZONA I 1991 1 9256.304704
1 ARIZONA I 1991 2 8851.689732
2 ARIZONA I 1991 3 7649.949328
3 ARIZONA I 1991 4 6728.544028
4 ARIZONA I 1991 5 8601.165457
... ... ... ...
6132 SANTANA II 2020 9 16481.202361
6133 SANTANA II 2020 10 15644.358737
6134 SANTANA II 2020 11 14368.804306
6135 SANTANA II 2020 12 15473.958468
6136 SANTANA II 2021 1 13161.219086
My new df "dfvar" should look like this showing the monthly deviation from long-term mean by year - i don't think these values are correct below:
plant_name year month Var
0 ARIZONA I 1991 1 -0.250259
1 ARIZONA I 1991 2 -0.283032
2 ARIZONA I 1991 3 -0.380370
3 ARIZONA I 1991 4 -0.455002
4 ARIZONA I 1991 5 -0.303324
I could do this easily in MATLAB but i'm not sure how to do this using pandas which i need to learn. Thank you very much. I've tried this below which gives me a series but there seem to be unexpected NaN's at the end rows:
t = dfmonthlys['power_kwh']/dfavgs.loc[:,'power_kwh'] - 1
the output from above looks like this:
t
Out[159]:
0 -0.131352
1 -0.095802
2 -0.030351
3 -0.155299
4 -0.075076
6132 NaN
6133 NaN
6134 NaN
6135 NaN
6136 NaN
Name: power_kwh, Length: 6137, dtype: float64
This is an example code of how you could do it. merge the dfavgs to the monthly data by month and plant name and then assign the calculation to a new column.
import numpy as np
import pandas as pd
dfavgs = {'plant_name':np.append(np.repeat(["ARIZONA I"], 12) , np.repeat("SANTANA II", 12)),
'month': np.tile(range(1, 13), 2),
'mnth_power_kwh': np.concatenate(([10655, 9789, 7889, 7965, 9299],
range(8000, 1500, -1000), range(12000, 500, -1000)))}
dfavgs=pd.DataFrame(dfavgs)
dfmonthlys = {'plant_name':np.append(np.repeat("ARIZONA I", 24), np.repeat("SANTANA II", 24)),
'year': np.tile(np.repeat([1991, 1992], 12), 2),
'month': np.tile(np.tile(range(1, 13), 2), 2),
'power_kwh': np.concatenate(([9256, 8851, 7649, 6728, 8601],
range(7000, 500, -1000),
range(13000, 1500, -1000),
range(25000, 1500, -1000)))}
dfmonthlys=pd.DataFrame(dfmonthlys)
merg=pd.merge(dfmonthlys, dfavgs, how="left", on=["month", "plant_name"])\
.assign(diff = lambda x: x["power_kwh"]/x["mnth_power_kwh"]-1)
print merg
Related
My df has USA states-related information. I want to rank the states based on its contribution.
My code:
df
State Value Year
0 FL 100 2012
1 CA 150 2013
2 MA 25 2014
3 FL 50 2014
4 CA 50 2015
5 MA 75 2016
Expected Answer: Compute state_capacity by summing state values from all years. Then Rank the States based on the state capacity
df
State Value Year State_Capa. Rank
0 FL 100 2012 150 2
1 CA 150 2013 200 1
2 MA 25 2014 100 3
3 FL 150 2014 200 2
4 CA 50 2015 200 1
5 MA 75 2016 100 3
My approach: I am able to compute the state capacity using groupby. I ran into NaN when mapped it to the df.
state_capacity = df[['State','Value']].groupby(['State']).sum()
df['State_Capa.'] = df['State'].map(dict(state_cap))
df
State Value Year State_Capa.
0 FL 100 2012 NaN
1 CA 150 2013 NaN
2 MA 25 2014 NaN
3 FL 50 2014 NaN
4 CA 50 2015 NaN
5 MA 75 2016 NaN
Try with transform then rank
df['new'] = df.groupby('State').Value.transform('sum').rank(method='dense',ascending=False)
Out[42]:
0 2.0
1 1.0
2 3.0
3 2.0
4 1.0
5 3.0
Name: Value, dtype: float64
As mentioned in the comment, your question seems to have a problem. However, I guess this might be what you want:
df = pd.DataFrame({
'state': ['FL', 'CA', 'MA', 'FL', 'CA', 'MA'],
'value': [100, 150, 25, 50, 50, 75],
'year': [2012, 2013, 2014, 2014, 2015, 2016]
})
returns:
state value year
0 FL 100 2012
1 CA 150 2013
2 MA 25 2014
3 FL 50 2014
4 CA 50 2015
5 MA 75 2016
and
groupby_sum = df.groupby('state')['state', 'value'].sum()
groupby_sum['rank'] = groupby_sum['value'].rank()
groupby_sum.reset_index()
returns:
state value rank
0 CA 200 3.0
1 FL 150 2.0
2 MA 100 1.0
I have a dataframe containing:
State Country Date Cases
0 NaN Afghanistan 2020-01-22 0
271 NaN Afghanistan 2020-01-23 0
... ... ... ... ...
85093 NaN Zimbabwe 2020-11-30 9950
85364 NaN Zimbabwe 2020-12-01 10129
I'm trying to create a new column of cumulative cases but grouped by Country AND State.
State Country Date Cases Total Cases
231 California USA 2020-01-22 5 5
342 California USA 2020-01-23 10 15
233 Texas USA 2020-01-22 4 4
322 Texas USA 2020-01-23 12 16
I have been trying to follow Pandas groupby cumulative sum and have tried things such as:
df['Total'] = df.groupby(['State','Country'])['Cases'].cumsum()
Returns a series of -1's
df['Total'] = df.groupby(['State', 'Country']).sum() \
.groupby(level=0).cumsum().reset_index()
Returns the sum.
df['Total'] = df.groupby(['Country'])['Cases'].apply(lambda x: x.cumsum())
Doesnt separate sums by state.
df_f['Total'] = df_f.groupby(['Region','State'])['Cases'].apply(lambda x: x.cumsum())
This one works exept when 'State' is NaN, 'Total' is also NaN.
arrays = [['California', 'California', 'Texas', 'Texas'],
['USA', 'USA', 'USA', 'USA'],
['2020-01-22','2020-01-23','2020-01-22','2020-01-23'], [5,10,4,12]]
df = pd.DataFrame(list(zip(*arrays)), columns = ['State', 'Country', 'Date', 'Cases'])
df
State Country Date Cases
0 California USA 2020-01-22 5
1 California USA 2020-01-23 10
2 Texas USA 2020-01-22 4
3 Texas USA 2020-01-23 12
temp = df.set_index(['State', 'Country','Date'], drop=True).sort_index( )
df['Total Cases'] = temp.groupby(['State', 'Country']).cumsum().reset_index()['Cases']
df
State Country Date Cases Total Cases
0 California USA 2020-01-22 5 5
1 California USA 2020-01-23 10 15
2 Texas USA 2020-01-22 4 4
3 Texas USA 2020-01-23 12 16
I have two data frames:
First:
Job = {'Name': ["Ron", "Joe", "Dan"],
'Job': [[2000, 2001], 1998, [2000, 1999]]
}
df = pd.DataFrame(Job, columns = ['Name', 'Job'])
Name Job
0 Ron [2000, 2001]
1 Joe 1998
2 Dan [2000, 1999]
Second:
Empty = {'Name': ["Ron", "Ron", "Ron", "Ron", "Joe", "Joe", "Joe", "Joe", "Dan", "Dan", "Dan", "Dan"],
'Year': [1998, 1999, 2000, 2001, 1998, 1999, 2000, 2001, 1998, 1999, 2000, 2001]
}
df2 = pd.DataFrame(Empty, columns = ['Name', 'Year'])
Name Year
0 Ron 1998
1 Ron 1999
2 Ron 2000
3 Ron 2001
4 Joe 1998
5 Joe 1999
6 Joe 2000
7 Joe 2001
8 Dan 1998
9 Dan 1999
10 Dan 2000
11 Dan 2001
I want to add a column to df2 (let's call it 'job_status'), where each year that is associated with a name in df1 will recieve 1 in df2 and 0 otherwise. This should be the output:
Name Year job_status
0 Ron 1998 0
1 Ron 1999 0
2 Ron 2000 1
3 Ron 2001 1
4 Joe 1998 1
5 Joe 1999 0
6 Joe 2000 0
7 Joe 2001 0
8 Dan 1998 0
9 Dan 1999 1
10 Dan 2000 1
11 Dan 2001 0
How can I accomplish this?
First explode the dataframe df on Job then left merge it with df2, finally using Series.notna + view assign the labels from [0, 1] to job_status:
d = df2.merge(df.explode('Job'), left_on=['Name', 'Year'], right_on=['Name', 'Job'], how='left')
d['job_status'] = d.pop('Job').notna().view('i1')
Result:
print(d)
Name Year job_status
0 Ron 1998 0
1 Ron 1999 0
2 Ron 2000 1
3 Ron 2001 1
4 Joe 1998 1
5 Joe 1999 0
6 Joe 2000 0
7 Joe 2001 0
8 Dan 1998 0
9 Dan 1999 1
10 Dan 2000 1
11 Dan 2001 0
I have data that looks like:
Year Month Region Value
1978 1 South 1
1990 1 North 22
1990 2 South 33
1990 2 Mid W 12
1998 1 South 1
1998 1 North 12
1998 2 South 2
1998 3 South 4
1998 1 Mid W 2
.
.
up to
2010
2010
My end date is 2010 but I want to sum all Values by the Region and Month by adding all previous year values together.
I don't want just a regular cumulative sum but a Monthly Cumulative Sum by Region where Month 1 of Region South is the cumulative Month 1 of Region South of all previous Month 1s before it, etc....
Desired output is something like:
Month Region Cum_Value
1 South 2
2 South 34
3 South 4
.
.
1 North 34
2 North 10
.
.
1 MidW 2
2 MidW 12
Use pd.DataFrame.groupby with pd.DataFrame.cumsum
df1['cumsum'] = df1.groupby(['Month', 'Region'])['Value'].cumsum()
Result:
Year Month Region Value cumsum
0 1978 1 South 1.0 1.0
1 1990 1 North 22.0 22.0
2 1990 2 South 33.0 33.0
3 1990 2 Mid W 12.0 12.0
4 1998 1 South 1.0 2.0
5 1998 1 North 12.0 34.0
6 1998 2 South 2.0 35.0
7 1998 3 South 4.0 4.0
8 1998 1 Mid W 2.0 2.0
Here's another solutions that corresponds more with your expected output.
df = pd.DataFrame({'Year': [1978,1990,1990,1990,1998,1998,1998,1998,1998],
'Month': [1,1,2,2,1,1,2,3,1],
'Region': ['South','North','South','Mid West','South','North','South','South','Mid West'],
'Value' : [1,22,33,12,1,12,2,4,2]})
#DataFrame Result
Year Month Region Value
0 1978 1 South 1
1 1990 1 North 22
2 1990 2 South 33
3 1990 2 Mid West 12
4 1998 1 South 1
5 1998 1 North 12
6 1998 2 South 2
7 1998 3 South 4
8 1998 1 Mid West 2
Code to run:
df1 = df.groupby(['Month','Region']).sum()
df1 = df1.drop('Year',axis=1)
df1 = df1.sort_values(['Month','Region'])
#Final Result
Month Region Value
1 Mid West 2
1 North 34
1 South 2
2 Mid West 12
2 South 35
3 South 4
This question already has answers here:
Pandas Merging 101
(8 answers)
Closed 4 years ago.
I want to help yours
if i have a pandas dataframe merge
first dataframe is
D = { Year, Age, Location, column1, column2... }
2013, 20 , america, ..., ...
2013, 35, usa, ..., ...
2011, 32, asia, ..., ...
2008, 45, japan, ..., ...
shape is 38654rows x 14 columns
second dataframe is
D = { Year, Location, column1, column2... }
2008, usa, ..., ...
2008, usa, ..., ...
2009, asia, ..., ...
2009, asia, ..., ...
2010, japna, ..., ...
shape is 96rows x 7 columns
I want to merge or join two different dataframe.
How can I do it?
thanks
IIUC you need merge with parameter how='left' if need left join on column Year and Location:
print (df1)
Year Age Location column1 column2
0 2013 20 america 7 5
1 2008 35 usa 8 1
2 2011 32 asia 9 3
3 2008 45 japan 7 1
print (df2)
Year Location column1 column2
0 2008 usa 8 9
1 2008 usa 7 2
2 2009 asia 8 2
3 2009 asia 0 1
4 2010 japna 9 3
df = pd.merge(df1,df2, on=['Year','Location'], how='left')
print (df)
Year Age Location column1_x column2_x column1_y column2_y
0 2013 20 america 7 5 NaN NaN
1 2008 35 usa 8 1 8.0 9.0
2 2008 35 usa 8 1 7.0 2.0
3 2011 32 asia 9 3 NaN NaN
4 2008 45 japan 7 1 NaN NaN
You can also check documentation.