Pandas Sorting a column after grouping by two columns

Pandas Sorting a column after grouping by two columns - python

I have a dataframe dfas:
Election Year Votes Vote % Party Region
0 2000 42289 29.40 Janata Dal (United) A
1 2000 27618 19.20 Rashtriya Janata Dal A
2 2000 20886 14.50 Bahujan Samaj Party B
3 2000 17747 12.40 Congress B
4 2000 14047 19.80 Independent C
5 2000 17047 10.80 JLS C
6 2005 8358 15.80 Janvadi Party A
7 2005 4428 13.10 Independent A
8 2005 1647 1.20 Independent B
9 2005 1610 11.10 Independent B
10 2005 1334 15.06 Nationalist C
11 2005 1834 18.06 NJM C
12 2010 21114 20.80 Independent A
13 2010 1042 10.5 Bharatiya Janta Dal A
14 2010 835 0.60 Independent B
15 2010 14305 15.50 Independent B
16 2010 22211 17.70 Congress C
16 2010 2211 17.70 INC C
I have used following code to sort "Vote %" in descending order after grouping by "Election year" and "Region". But it is giving an error.
df1 = df.groupby(['Election Year','Region'])sort_values('Vote %', ascending = False).reset_index()
How to correct the error as I want to get the top 3 "Party" of each region in each year after the sorting.

You can perform the grouping and the in-group sorting through sort itself:
df1 = df.sort_values(['Election Year','Region', 'Vote %'], ascending=False)

Related

Replace last value(s) of group with NaN

My goal is to replace the last value (or the last several values) of each id with NaN. My real dataset is quite large and has groups of different sizes.
Example:
import pandas as pd
ids = [1,1,1,1,1,1,2,2,2,2,2,2,3,3,3,3,3,3]
year = [2000,2001,2002,2003,2004,2005,1990,1991,1992,1993,1994,1995,2010,2011,2012,2013,2014,2015]
percent = [120,70,37,40,50,110,140,100,90,5,52,80,60,40,70,60,50,110]
dictex ={"id":ids,"year":year,"percent [%]": percent}
dfex = pd.DataFrame(dictex)
print(dfex)
id year percent [%]
0 1 2000 120
1 1 2001 70
2 1 2002 37
3 1 2003 40
4 1 2004 50
5 1 2005 110
6 2 1990 140
7 2 1991 100
8 2 1992 90
9 2 1993 5
10 2 1994 52
11 2 1995 80
12 3 2010 60
13 3 2011 40
14 3 2012 70
15 3 2013 60
16 3 2014 50
17 3 2015 110
My goal is to replace the last 1 / or 2 / or 3 values of the "percent [%]" column for each id (group) with NaN.
The result should look like this: (here: replace the last 2 values of each id)
id year percent [%]
0 1 2000 120
1 1 2001 70
2 1 2002 37
3 1 2003 40
4 1 2004 NaN
5 1 2005 NaN
6 2 1990 140
7 2 1991 100
8 2 1992 90
9 2 1993 5
10 2 1994 NaN
11 2 1995 NaN
12 3 2010 60
13 3 2011 40
14 3 2012 70
15 3 2013 60
16 3 2014 NaN
17 3 2015 NaN
I know there should be a relatively easy solution for this, but i'm new to python and simply haven't been able to figure out an elegant way.
Thanks for the help!

try using groupby, tail and index to find the index of those rows that will be modified and use loc to change the values
nrows = 2
idx = df.groupby('id').tail(nrows).index
df.loc[idx, 'percent [%]'] = np.nan
#output
id year percent [%]
0 1 2000 120.0
1 1 2001 70.0
2 1 2002 37.0
3 1 2003 40.0
4 1 2004 NaN
5 1 2005 NaN
6 2 1990 140.0
7 2 1991 100.0
8 2 1992 90.0
9 2 1993 5.0
10 2 1994 NaN
11 2 1995 NaN
12 3 2010 60.0
13 3 2011 40.0
14 3 2012 70.0
15 3 2013 60.0
16 3 2014 NaN
17 3 2015 NaN

Finding all values in between specific values in data frame

i have this dataframe.
df
name timestamp year
0 A 2004 1995
1 D 2008 2004
2 M 2005 2006
3 T 2003 2007
4 B 1995 2008
5 C 2007 2003
6 D 2005 2001
7 E 2009 2005
8 A 2018 2009
9 L 2016 2018
What i am doing is that on the basis of first two entries in the df['timestamp']. I am fetching all the values from df['year'] which comes in between these two entries. Which in this case is (2004-2008).
y1 = df['timestamp'].iloc[0]
y2 = df['timestamp'].iloc[1]
movies = df[df['year'].between(y1, y2,inclusive=True )]
movies
name timestamp year
1 D 2008 2004
2 M 2005 2006
3 T 2003 2007
4 B 1995 2008
7 E 2009 2005
This is working fine for me. But when i have greater value in first index and lower in 2nd index (e.g. 2008-2004) the result is empty.
df
name timestamp year
0 A 2008 1995
1 D 2004 2004
2 M 2005 2006
3 T 2003 2007
4 B 1995 2008
5 C 2007 2003
6 D 2005 2001
7 E 2009 2005
8 A 2018 2009
9 L 2016 2018
In this case i fetch nothing.
Expected Outcome:
What i want is if the values are greater or smaller i should get in-between values every time.

You could use Series.head and Series.agg:
y1, y2 = df['timestamp'].head(2).agg(['min', 'max'])
movies = df[df['year'].between(y1, y2,inclusive=True )]
[out]
name timestamp year
1 D 2004 2004
2 M 2005 2006
3 T 2003 2007
4 B 1995 2008
7 E 2009 2005

You can fix that by changing just two lines of code:
y1 = min(df['timestamp'].iloc[0], df['timestamp'].iloc[1])
y2 = max(df['timestamp'].iloc[0], df['timestamp'].iloc[1])
in this way y1 is always less or equal than y2.
However as #ALollz pointed out it is possible to save both computation and coding time by using
y1,y2 = np.sort(df['timestamp'].head(2))

Comparing daily value in each year in DataFrame to same day-number's value in another specific year

I have a daily time series of closing prices of a financial instrument going back to 1990.
I am trying to compare the daily percentage change for each trading day of the previous years to it's respective trading day in 2019. I have 41 trading days of data for 2019 at this time.
I get so far as filtering down and creating a new DataFrame with only the first 41 dates, closing prices, daily percentage changes, and the "trading day of year" ("tdoy") classifier for each day in the set, but am not having luck from there.
I've found other Stack Overflow questions that help people compare datetime days, weeks, years, etc. but I am not able to recreate this because of the arbitrary value each "tdoy" represents.
I won't bother creating a sample DataFrame because of the number of rows so I've linked the CSV I've come up with to this point: Sample CSV.
I think the easiest approach would just be to create a new column that returns what the 2019 percentage change is for each corresponding "tdoy" (Trading Day of Year) using df.loc, and if I could figure this much out I could then create yet another column to do the simple difference between that year/day's percentage change to 2019's respective value. Below is what I try to use (and I've tried other variations) to no avail.
df['2019'] = df['perc'].loc[((df.year == 2019) & (df.tdoy == df.tdoy))]
I've tried to search Stack and Google in probably 20 different variations of my problem and can't seem to find an answer that fits my issue of arbitrary "Trading Day of Year" classification.
I'm sure the answer is right in front of my face somewhere but I am still new to data wrangling.

First step is to import the csv properly. I'm not sure if you made the adjustment, but your data's date column is a string object.
# import the csv and assign to df. parse dates to datetime
df = pd.read_csv('TimeSeriesEx.csv', parse_dates=['Dates'])
# filter the dataframe so that you only have 2019 and 2018 data
df=df[df['year'] >= 2018]
df.tail()
Unnamed: 0 Dates last perc year tdoy
1225 7601 2019-02-20 29.96 0.007397 2019 37
1226 7602 2019-02-21 30.49 0.017690 2019 38
1227 7603 2019-02-22 30.51 0.000656 2019 39
1228 7604 2019-02-25 30.36 -0.004916 2019 40
1229 7605 2019-02-26 30.03 -0.010870 2019 41
Put the tdoy and year into a multiindex.
# create a multiindex
df.set_index(['tdoy','year'], inplace=True)
df.tail()
Dates last perc
tdoy year
37 2019 7601 2019-02-20 29.96 0.007397
38 2019 7602 2019-02-21 30.49 0.017690
39 2019 7603 2019-02-22 30.51 0.000656
40 2019 7604 2019-02-25 30.36 -0.004916
41 2019 7605 2019-02-26 30.03 -0.010870
Make pivot table
# make a pivot table and assign it to a variable
df1 = df.pivot_table(values='last', index='tdoy', columns='year')
df1.head()
year 2018 2019
tdoy
1 33.08 27.55
2 33.38 27.90
3 33.76 28.18
4 33.74 28.41
5 33.65 28.26
Create calculated column
# create the new column
df1['pct_change'] = (df1[2019]-df1[2018])/df1[2018]
df1
year 2018 2019 pct_change
tdoy
1 33.08 27.55 -0.167170
2 33.38 27.90 -0.164170
3 33.76 28.18 -0.165284
4 33.74 28.41 -0.157973
5 33.65 28.26 -0.160178
6 33.43 28.18 -0.157045
7 33.55 28.32 -0.155887
8 33.29 27.94 -0.160709
9 32.97 28.17 -0.145587
10 32.93 28.11 -0.146371
11 32.93 28.24 -0.142423
12 32.79 28.23 -0.139067
13 32.51 28.77 -0.115042
14 32.23 29.01 -0.099907
15 32.28 29.01 -0.101301
16 32.16 29.06 -0.096393
17 32.52 29.38 -0.096556
18 32.68 29.51 -0.097001
19 32.50 30.03 -0.076000
20 32.79 30.30 -0.075938
21 32.87 30.11 -0.083967
22 33.08 30.42 -0.080411
23 33.07 30.17 -0.087693
24 32.90 29.89 -0.091489
25 32.51 30.13 -0.073208
26 32.50 30.38 -0.065231
27 33.16 30.90 -0.068154
28 32.56 30.81 -0.053747
29 32.21 30.87 -0.041602
30 31.96 30.24 -0.053817
31 31.85 30.33 -0.047724
32 31.57 29.99 -0.050048
33 31.80 29.89 -0.060063
34 31.70 29.95 -0.055205
35 31.54 29.95 -0.050412
36 31.54 29.74 -0.057070
37 31.86 29.96 -0.059636
38 32.07 30.49 -0.049267
39 32.04 30.51 -0.047753
40 32.36 30.36 -0.061805
41 32.62 30.03 -0.079399
Altogether without comments and data, the codes looks like:
df = pd.read_csv('TimeSeriesEx.csv', parse_dates=['Dates'])
df=df[df['year'] >= 2018]
df.set_index(['tdoy','year'], inplace=True)
df1 = df.pivot_table(values='last', index='tdoy', columns='year')
df1['pct_change'] = (df1[2019]-df1[2018])/df1[2018]
[EDIT] poster requesting for all dates compared to 2019.
df = pd.read_csv('TimeSeriesEx.csv', parse_dates=['Dates'])
df.set_index(['tdoy','year'], inplace=True)
Ignore year filter above, create pivot table
df1 = df.pivot_table(values='last', index='tdoy', columns='year')
Create a loop going through the years/columns and create a new field for each year comparing to 2019.
for y in df1.columns:
df1[str(y) + '_pct_change'] = (df1[2019]-df1[y])/df1[y]
To view some data...
df1.loc[1:4, "1990_pct_change":"1994_pct_change"]
year 1990_pct_change 1991_pct_change 1992_pct_change 1993_pct_change 1994_pct_change
tdoy
1 0.494845 0.328351 0.489189 0.345872 -0.069257
2 0.496781 0.364971 0.516304 0.361640 -0.045828
3 0.523243 0.382050 0.527371 0.369956 -0.035262
4 0.524960 0.400888 0.531536 0.367838 -0.034659
Final code for all years:
df = pd.read_csv('TimeSeriesEx.csv', parse_dates=['Dates'])
df.set_index(['tdoy','year'], inplace=True)
df1 = df.pivot_table(values='last', index='tdoy', columns='year')
for y in df1.columns:
df1[str(y) + '_pct_change'] = (df1[2019]-df1[y])/df1[y]
df1

I also came up with my own answer more along the lines of what I was trying to originally accomplish. DataFrame I'll work with for the example. df:
Dates last perc year tdoy
0 2016-01-04 29.93 -0.020295 2016 2
1 2016-01-05 29.63 -0.010023 2016 3
2 2016-01-06 29.59 -0.001350 2016 4
3 2016-01-07 29.44 -0.005069 2016 5
4 2017-01-03 34.57 0.004358 2017 2
5 2017-01-04 34.98 0.011860 2017 3
6 2017-01-05 35.00 0.000572 2017 4
7 2017-01-06 34.77 -0.006571 2017 5
8 2018-01-02 33.38 0.009069 2018 2
9 2018-01-03 33.76 0.011384 2018 3
10 2018-01-04 33.74 -0.000592 2018 4
11 2018-01-05 33.65 -0.002667 2018 5
12 2019-01-02 27.90 0.012704 2019 2
13 2019-01-03 28.18 0.010036 2019 3
14 2019-01-04 28.41 0.008162 2019 4
15 2019-01-07 28.26 -0.005280 2019 5
I created a DataFrame with only the 2019 values for tdoy and perc
df19 = df[['tdoy','perc']].loc[df['year'] == 2019]
and then zipped a dictionary for those values
perc19 = dict(zip(df19.tdoy,df19.perc))
to end up with
perc19=
{2: 0.012704174228675058,
3: 0.010035842293906852,
4: 0.008161816891412365,
5: -0.005279831045406497}
Then map these keys with the tdoy column in the original DataFrame to create a column titled 2019 that has the corresponding 2019 percentage change value for that trading day
df['2019'] = df['tdoy'].map(perc19)
and then create a vs2019 column where I find the difference of 2019 vs. perc and square it yielding
Dates last perc year tdoy 2019 vs2019
0 2016-01-04 29.93 -0.020295 2016 2 0.012704 6.746876
1 2016-01-05 29.63 -0.010023 2016 3 0.010036 3.995038
2 2016-01-06 29.59 -0.001350 2016 4 0.008162 1.358162
3 2016-01-07 29.44 -0.005069 2016 5 -0.005280 0.001590
4 2017-01-03 34.57 0.004358 2017 2 0.012704 0.431608
5 2017-01-04 34.98 0.011860 2017 3 0.010036 0.033038
6 2017-01-05 35.00 0.000572 2017 4 0.008162 0.864802
7 2017-01-06 34.77 -0.006571 2017 5 -0.005280 0.059843
8 2018-01-02 33.38 0.009069 2018 2 0.012704 0.081880
9 2018-01-03 33.76 0.011384 2018 3 0.010036 0.018047
10 2018-01-04 33.74 -0.000592 2018 4 0.008162 1.150436
From here I can groupby in various ways and further calculate to find most similar trending percentage changes vs. the year I am comparing against (2019).

How to add a column with the growth rate in a budget table in Pandas?

I would like to know how can I add a growth rate year to year in the following data in Pandas.
Date Total Managed Expenditure
0 2001 503.2
1 2002 529.9
2 2003 559.8
3 2004 593.2
4 2005 629.5
5 2006 652.1
6 2007 664.3
7 2008 688.2
8 2009 732.0
9 2010 759.2
10 2011 769.2
11 2012 759.8
12 2013 760.6
13 2014 753.3
14 2015 757.6
15 2016 753.9

Use Series.pct_change():
df['Total Managed Expenditure'].pct_change()
Out:
0 NaN
1 0.053060
2 0.056426
3 0.059664
4 0.061194
5 0.035902
6 0.018709
7 0.035978
8 0.063644
9 0.037158
10 0.013172
11 -0.012220
12 0.001053
13 -0.009598
14 0.005708
15 -0.004884
Name: Total Managed Expenditure, dtype: float64
To assign it back:
df['Growth Rate'] = df['Total Managed Expenditure'].pct_change()

Filter rows in a pandas DataFrame based on a value

I have DataFrame similar to the below (this is just a sample):
i TIME CITIES_LABEL Value lat_rounded long
2 2005 Tilburg 22 250 52.070498 4.300700
3 2005 Amsterdam 45 825 52.370216 4.895168
4 2005 Rotterdam 27 600 51.924420 4.477733
5 2005 Utrecht 12 915 52.090737 5.121420
6 2005 Eindhoven 9 165 51.441642 5.469722
7 2006 Tilburg 7 800 51.560596 5.091914
8 2005 Groningen 7 620 53.219383 6.566502
9 2005 Enschede 6 250 52.221537 6.893662
10 2005 Arnhem 6 025 51.985103 5.898730
11 2006 Utrecht 3 400 50.888174 5.979499
12 2006 Amsterdam 6 795 52.350785 5.264702
13 2005 Breda 8 565 51.571915 4.768323
14 2010 Groningen 6 325 51.812563 5.837226
15 2005 Apeldoorn 7 005 52.211157 5.969923
16 2007 Utrecht 3 785 53.201233 5.799913
17 2006 Rotterdam 7 130 52.387388 4.646219
18 2005 Zaanstad 6 060 52.457966 4.751042
19 2008 Tilburg 6 945 51.697816 5.303675
20 2007 Amsterdam 5 840 52.156111 5.387827
21 2005 Maastricht 5 220 50.851368 5.690972
Cities are repeated along the CITIES_LABEL field. I would like to filter the cities based on their highest TIME value. An example of the output I would like is:
i TIME CITIES_LABEL Value lat_rounded long
6 2005 Eindhoven 9 165 51.441642 5.469722
9 2005 Enschede 6 250 52.221537 6.893662
10 2005 Arnhem 6 025 51.985103 5.898730
13 2005 Breda 8 565 51.571915 4.768323
14 2010 Groningen 6 325 51.812563 5.837226
15 2005 Apeldoorn 7 005 52.211157 5.969923
16 2007 Utrecht 3 785 53.201233 5.799913
17 2006 Rotterdam 7 130 52.387388 4.646219
18 2005 Zaanstad 6 060 52.457966 4.751042
19 2008 Tilburg 6 945 51.697816 5.303675
20 2007 Amsterdam 5 840 52.156111 5.387827
21 2005 Maastricht 5 220 50.851368 5.690972
Any thoughts on how best to approach this issue in pandas?
EDIT
my question is different from Python : How can I get Rows which have the max value of the group to which they belong? because I am looking for a filter for both TIME and CITIES_LABEL while the previous question is only looking at filtering based to a (maximum) value of one field, and it does not care for duplicates in other fields

use groupby and idxmax
df.ix[df.groupby('CITIES_LABEL').TIME.idxmax()]

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.