I have some data and after using a groupby function I now have a series that looks like this:
year
1997 15
1998 22
1999 24
2000 24
2001 28
2002 11
2003 15
2004 19
2005 10
2006 10
2007 21
2008 26
2009 23
2010 16
2011 33
2012 19
2013 26
2014 25
How can I create a pandas dataframe from here with year as one column and the other column named sightings ?
I am a pandas novice so don't really know what I am doing. I have tried the reindex and unstack functions but haven't been able to get what I want...
You can use reset_index and rename columns:
print (df.reset_index())
index year
0 1997 15
1 1998 22
2 1999 24
3 2000 24
4 2001 28
5 2002 11
6 2003 15
7 2004 19
8 2005 10
9 2006 10
10 2007 21
11 2008 26
12 2009 23
13 2010 16
14 2011 33
15 2012 19
16 2013 26
17 2014 25
print (df.reset_index().rename(columns=({'index':'year','year':'sightings'})))
year sightings
0 1997 15
1 1998 22
2 1999 24
3 2000 24
4 2001 28
5 2002 11
6 2003 15
7 2004 19
8 2005 10
9 2006 10
10 2007 21
11 2008 26
12 2009 23
13 2010 16
14 2011 33
15 2012 19
16 2013 26
17 2014 25
Another solution is set column names by list of names:
df1 = df.reset_index()
df1.columns = ['year','sightings']
print (df1)
year sightings
0 1997 15
1 1998 22
2 1999 24
3 2000 24
4 2001 28
5 2002 11
6 2003 15
7 2004 19
8 2005 10
9 2006 10
10 2007 21
11 2008 26
12 2009 23
13 2010 16
14 2011 33
15 2012 19
16 2013 26
17 2014 25
EDIT:
Sometimes help add parameter as_index=False to groupby for returning DataFrame:
import pandas as pd
df = pd.DataFrame({'A':[1,1,3],
'B':[4,5,6]})
print (df)
A B
0 1 4
1 1 5
2 3 6
print (df.groupby('A')['B'].sum())
A
1 9
3 6
Name: B, dtype: int64
print (df.groupby('A', as_index=False)['B'].sum())
A B
0 1 9
1 3 6
s.rename('sightings').reset_index()
I've also used this method during the groupby stage to put the results straight into a dataframe:
df2 = df1.groupby(['Year']).count()
df3 = pd.DataFrame(df2).reset_index()
If your original dataframe - df1 - had "Year" and "Sightings" as it's two columns then df3 should have each year listed under "Year" and the count (or sum, average, whatever) listed under "Sightings".
If not, you can change the column names by doing the following:
df3.columns = ['Year','Sightings']
or
df3 = df3.rename(columns={'oldname_A': 'Year', 'oldname_B': 'Sightings'})
Related
This question already has answers here:
How do I melt a pandas dataframe?
(3 answers)
Closed 6 months ago.
I wanted to know if there's a way to melt a DataFrame with multiple column names.
I have this Pandas Data Frame:
Edad 2000 2001 2002 2003 ... 2017 2018 2019 2020
...
[15-25] 126675 158246 171958 188389 ... 78707 70246 65661 52209
(25-35] 65823 85059 92841 95394 ... 88479 157492 149862 122067
(35-45] 37474 48605 54593 56279 ... 65870 65798 64587 51502
(45-55] 20624 22067 25860 27601 ... 39476 40725 40566 33979
(55-65] 30240 9047 10500 10972 ... 20135 21095 21173 17242
And would like to have something like this:
Edad Year Value
[15-25] 2000 126675
[15-25] 2001 158246
[15-25] 2002 171958
[15-25] 2003 188389
I've used Melt before but I always address a value column, this time I have my values as cells and I'm having a very hard time figuring out how to address them.
You can use melt with groupby and sort like this:
df.melt(id_vars='Edad', var_name='Year').groupby(['Edad','Year']).agg({'value':'first'}).reset_index().sort_values(by=['Edad','Year'], ascending=[False,True])
Desired results:
Edad Year value
32 [15-25] 2000 126675
33 [15-25] 2001 158246
34 [15-25] 2002 171958
35 [15-25] 2003 188389
36 [15-25] 2017 78707
37 [15-25] 2018 70246
38 [15-25] 2019 65661
39 [15-25] 2020 52209
24 (55-65] 2000 30240
25 (55-65] 2001 9047
26 (55-65] 2002 10500
27 (55-65] 2003 10972
28 (55-65] 2017 20135
29 (55-65] 2018 21095
30 (55-65] 2019 21173
31 (55-65] 2020 17242
16 (45-55] 2000 20624
17 (45-55] 2001 22067
18 (45-55] 2002 25860
19 (45-55] 2003 27601
20 (45-55] 2017 39476
21 (45-55] 2018 40725
22 (45-55] 2019 40566
23 (45-55] 2020 33979
8 (35-45] 2000 37474
9 (35-45] 2001 48605
10 (35-45] 2002 54593
11 (35-45] 2003 56279
12 (35-45] 2017 65870
13 (35-45] 2018 65798
14 (35-45] 2019 64587
15 (35-45] 2020 51502
0 (25-35] 2000 65823
1 (25-35] 2001 85059
2 (25-35] 2002 92841
3 (25-35] 2003 95394
4 (25-35] 2017 88479
5 (25-35] 2018 157492
6 (25-35] 2019 149862
7 (25-35] 2020 122067
I am trying to extract the last year (YY) of a fiscal date string in the format of YYYY-YY. e.g The last year of this '1999-00' would be 2000.
Current code seems to cover most cases other than this.
import pandas as pd
import numpy as np
test_df = pd.DataFrame(data={'Season':['1996-97', '1997-98', '1998-99',
'1999-00', '2000-01', '2001-02',
'2002-03','2003-04','2004-05',
'2005-06','2006-07','2007-08',
'2008-09', '2009-10', '2010-11', '2011-12'],
'Height':np.random.randint(20, size=16),
'Weight':np.random.randint(40, size=16)})
I need a logic to include a case where if it is the end of the century then my apply method should add to the first two digits, I believe this is the only case I am missing.
Current code is as follows:
test_df['Season'] = test_df['Season'].apply(lambda x: x[0:2] + x[5:7])
This should work too:
pd.to_numeric(test_df['Season'].str.split('-').str[0]) + 1
Output:
0 1997
1 1998
2 1999
3 2000
4 2001
5 2002
6 2003
7 2004
8 2005
9 2006
10 2007
11 2008
12 2009
13 2010
14 2011
15 2012
You can use .str.extract to extract the first four digits
df['Season'] = df['Season'].str.extract('^(\d{4})').astype(int).add(1)
Season Height Weight
0 1997 4 22
1 1998 18 4
2 1999 19 27
3 2000 7 10
4 2001 19 9
5 2002 18 31
6 2003 19 9
7 2004 18 29
8 2005 13 17
9 2006 13 30
10 2007 5 14
11 2008 15 3
12 2009 13 10
13 2010 15 8
14 2011 0 23
15 2012 2 38
Here you go! Use the following function instead of the lambda:
def get_season(string):
century = int(string[:2])
preyear = int(string[2:4])
postyear = int(string[5:7])
if postyear < preyear:
century += 1
# zfill is so that "1" becomes "01"
return str(century).zfill(2) + str(postyear).zfill(2)
I use the fiscalyear module.
import numpy as np
import pandas as pd
import fiscalyear as fy
...
test_df['Season'] = test_df['Season'].apply(lambda x : fy.FiscalYear(int(x[0:4]) + 1).fiscal_year)
print(test_df)
ID LIST_OF_TUPLE (2col)
1 [('2012','12'), ('2012','33'), ('2014', '82')]
2 NA
3 [('2012','12')]
4 [('2012','12'), ('2012','33'), ('2014', '82'), ('2022', '67')]
Result:
ID TUP_1 TUP_2(3col)
1 2012 12
1 2012 33
1 2014 82
3 2012 12
4 2012 12
4 2012 33
4 2014 82
4 2022 67
Thanks in advance.
This is explode then create a dataframe and then join:
s = df['LIST_OF_TUPLE'].explode()
out = (df[['ID']].join(pd.DataFrame(s.tolist(),index=s.index)
.add_prefix("TUP_")).reset_index(drop=True)) #you can chain a dropna if reqd
print(out)
ID TUP_0 TUP_1
0 1 2012 12
1 1 2012 33
2 1 2014 82
3 2 NaN None
4 3 2012 12
5 4 2012 12
6 4 2012 33
7 4 2014 82
8 4 2022 67
I am trying to create a new variable which performs the SALES_AMOUNT difference between years-month on the following dataframe. I think my code should be think with this groupby but i dont know how to add the condition [df2 df.Control - df.Control.shift(1) == 12] after the groupby so as to perform a correct difference between years
df['LY'] = df.groupby(['month']).SALES_AMOUNT.shift(1)
Dataframe:
SALES_AMOUNT Store Control year month
0 16793.14 A 3 2013 3
1 42901.61 A 5 2013 5
2 63059.72 A 6 2013 6
3 168471.43 A 10 2013 10
4 58570.72 A 11 2013 11
5 67526.71 A 12 2013 12
6 50649.07 A 14 2014 2
7 48819.97 A 18 2014 6
8 97100.77 A 19 2014 7
9 67778.40 A 21 2014 9
10 90327.52 A 22 2014 10
11 75703.12 A 23 2014 11
12 26098.50 A 24 2014 12
13 81429.36 A 25 2015 1
14 19539.85 A 26 2015 2
15 71727.66 A 27 2015 3
16 20117.79 A 28 2015 4
17 44252.19 A 29 2015 6
18 68578.82 A 30 2015 7
19 91483.39 A 31 2015 8
20 39220.87 A 32 2015 10
21 12224.11 A 33 2015 11
result should look like this:
SALES_AMOUNT Store Control year month year_diff
0 16793.14 A 3 2013 3 Nan
1 42901.61 A 5 2013 5 Nan
2 63059.72 A 6 2013 6 Nan
3 168471.43 A 10 2013 10 Nan
4 58570.72 A 11 2013 11 Nan
5 67526.71 A 12 2013 12 Nan
6 50649.07 A 14 2014 2 Nan
7 48819.97 A 18 2014 6 -14239.75
8 97100.77 A 19 2014 7 Nan
9 67778.40 A 21 2014 9 Nan
10 90327.52 A 22 2014 10 -78143.91
11 75703.12 A 23 2014 11 17132.4
12 26098.50 A 24 2014 12 -41428.21
13 81429.36 A 25 2015 1 Nan
14 19539.85 A 26 2015 2 -31109.22
15 71727.66 A 27 2015 3 Nan
16 20117.79 A 28 2015 4 Nan
17 44252.19 A 29 2015 6 -4567.78
18 68578.82 A 30 2015 7 -28521.95
19 91483.39 A 31 2015 8 Nan
20 39220.87 A 32 2015 10 -51106.65
21 12224.11 A 33 2015 11 -63479.01
I think what you're looking for is the below:
df = df.sort_values(by=['month', 'year'])
df['SALES_AMOUNT_shifted'] = df.groupby(['month'])['SALES_AMOUNT'].shift(1).tolist()
df['LY'] = df['SALES_AMOUNT'] - df['SALES_AMOUNT_shifted']
Once you sort by month and year, the month groups will be organized in a consistent way and then the shift makes sense.
-- UPDATE --
After applying the solution above, you could set to None all instances where the year difference is greater than 1.
df['year_diff'] = df['year'] - df.groupby(['month'])['year'].shift()
df['year_diff'] = df['year_diff'].fillna(0)
df.loc[df['year_diff'] != 1, 'LY'] = None
Using this I'm getting the desired output that you added.
Does this work? I would also greatly appreciate a pandas-centric solution, as I spent some time on this and could not come up with one.
df = pd.read_clipboard().set_index('Control')
df['yoy_diff'] = np.nan
for i in df.index:
for j in df.index:
if j - i == 12:
df['yoy_diff'].loc[j] = df.loc[j, 'SALES_AMOUNT'] - df.loc[i, 'SALES_AMOUNT']
df
Output:
SALES_AMOUNT Store year month yoy_diff
Control
3 16793.14 A 2013 3 NaN
5 42901.61 A 2013 5 NaN
6 63059.72 A 2013 6 NaN
10 168471.43 A 2013 10 NaN
11 58570.72 A 2013 11 NaN
12 67526.71 A 2013 12 NaN
14 50649.07 A 2014 2 NaN
18 48819.97 A 2014 6 -14239.75
19 97100.77 A 2014 7 NaN
21 67778.40 A 2014 9 NaN
22 90327.52 A 2014 10 -78143.91
23 75703.12 A 2014 11 17132.40
24 26098.50 A 2014 12 -41428.21
25 81429.36 A 2015 1 NaN
26 19539.85 A 2015 2 -31109.22
27 71727.66 A 2015 3 NaN
28 20117.79 A 2015 4 NaN
29 44252.19 A 2015 6 NaN
30 68578.82 A 2015 7 19758.85
31 91483.39 A 2015 8 -5617.38
32 39220.87 A 2015 10 NaN
33 12224.11 A 2015 11 -55554.29
I would like to know how can I add a growth rate year to year in the following data in Pandas.
Date Total Managed Expenditure
0 2001 503.2
1 2002 529.9
2 2003 559.8
3 2004 593.2
4 2005 629.5
5 2006 652.1
6 2007 664.3
7 2008 688.2
8 2009 732.0
9 2010 759.2
10 2011 769.2
11 2012 759.8
12 2013 760.6
13 2014 753.3
14 2015 757.6
15 2016 753.9
Use Series.pct_change():
df['Total Managed Expenditure'].pct_change()
Out:
0 NaN
1 0.053060
2 0.056426
3 0.059664
4 0.061194
5 0.035902
6 0.018709
7 0.035978
8 0.063644
9 0.037158
10 0.013172
11 -0.012220
12 0.001053
13 -0.009598
14 0.005708
15 -0.004884
Name: Total Managed Expenditure, dtype: float64
To assign it back:
df['Growth Rate'] = df['Total Managed Expenditure'].pct_change()