Shifting columns in grouped pandas dataframe - python

I have a dataframe which, after grouping it by country and group looks like this:
A B C D
country group
1 a1 10 20 30 40
a2 11 21 31 41
a3 12 22 32 42
a4 13 23 33 43
A B C D
country group
2 a1 50 60 70 80
a2 51 61 71 81
a3 52 62 72 82
a4 53 63 73 83
My goal is to create another column E that would hold column D values shifted up by 1 row like so:
A B C D E
country group
1 a1 10 20 30 40 41
a2 11 21 31 41 42
a3 12 22 32 42 43
a4 13 23 33 43 nan
A B C D E
country group
2 a1 50 60 70 80 81
a2 51 61 71 81 82
a3 52 62 72 82 83
a4 53 63 73 83 nan
What I've tried:
df.groupby(['country','group']).sum().apply(lambda x['E']: x['D'].shift(-1))
but I get invalid syntax.
Afterwards I am trying to delete those bottom lines in each group where nan is present like so:
df = df[~df.isin([np.nan]).any(1)] which works.
How can I add a column E to the df which would hold column D values shifted by -1?

Use DataFrameGroupBy.shift by first level:
df = df.groupby(['country','group']).sum()
df['E'] = df.groupby(level=0)['D'].shift(-1)
And then DataFrame.dropna:
df = df.dropna(subset=['E'])
Sample:
print (df)
country group A B C D
0 1 a1 10 20 30 40
1 1 a1 11 21 31 41
2 1 a1 12 22 32 42
3 1 a2 13 23 33 43
4 1 a2 11 21 31 41
5 1 a2 12 22 32 42
6 1 a3 13 23 33 43
7 1 a3 11 21 31 41
8 1 a3 12 22 32 42
9 1 a4 13 23 33 43
10 1 a4 11 21 31 41
11 1 a5 12 22 32 42
12 1 a5 13 23 33 43
13 2 a2 50 60 70 80
14 2 a3 51 61 71 81
15 2 a4 52 62 72 82
16 2 a5 53 63 73 83
df = df.groupby(['country','group']).sum()
print (df)
A B C D
country group
1 a1 33 63 93 123
a2 36 66 96 126
a3 36 66 96 126
a4 24 44 64 84
a5 25 45 65 85
2 a2 50 60 70 80
a3 51 61 71 81
a4 52 62 72 82
a5 53 63 73 83
df['E'] = df.groupby(level=0)['D'].shift(-1)
print (df)
A B C D E
country group
1 a1 33 63 93 123 126.0
a2 36 66 96 126 126.0
a3 36 66 96 126 84.0
a4 24 44 64 84 85.0
a5 25 45 65 85 NaN
2 a2 50 60 70 80 81.0
a3 51 61 71 81 82.0
a4 52 62 72 82 83.0
a5 53 63 73 83 NaN
df = df.dropna(subset=['E'])
print (df)
A B C D E
country group
1 a1 33 63 93 123 126.0
a2 36 66 96 126 126.0
a3 36 66 96 126 84.0
a4 24 44 64 84 85.0
2 a2 50 60 70 80 81.0
a3 51 61 71 81 82.0
a4 52 62 72 82 83.0

Related

Finding the maximum difference for a subset of columns with pandas

I have a dataframe:
A B C D E
0 a 34 55 43 aa
1 b 53 77 65 bb
2 c 23 100 34 cc
3 d 54 43 23 dd
4 e 23 67 54 ee
5 f 43 98 23 ff
I need to get the maximum difference between the column B,C and D and return the value in column A . in row 'a' maximum difference between columns is 55 - 34 = 21 . data is in a dataframe.
The expected result is
A B C D E
0 21 34 55 43 aa
1 24 53 77 65 bb
2 77 23 100 34 cc
3 31 54 43 23 dd
4 44 23 67 54 ee
5 75 43 98 23 ff
Use np.ptp:
# df['A'] = np.ptp(df.loc[:, 'B':'D'], axis=1)
df['A'] = np.ptp(df[['B', 'C', 'D']], axis=1)
df
A B C D E
0 21 34 55 43 aa
1 24 53 77 65 bb
2 77 23 100 34 cc
3 31 54 43 23 dd
4 44 23 67 54 ee
5 75 43 98 23 ff
Or, find the max and min yourself:
df['A'] = df[['B', 'C', 'D']].max(1) - df[['B', 'C', 'D']].min(1)
df
A B C D E
0 21 34 55 43 aa
1 24 53 77 65 bb
2 77 23 100 34 cc
3 31 54 43 23 dd
4 44 23 67 54 ee
5 75 43 98 23 ff
If performance is important, you can do this in NumPy space:
v = df[['B', 'C', 'D']].values
df['A'] = v.max(1) - v.min(1)
df
A B C D E
0 21 34 55 43 aa
1 24 53 77 65 bb
2 77 23 100 34 cc
3 31 54 43 23 dd
4 44 23 67 54 ee
5 75 43 98 23 ff

Panda Dataframe query

I like to retrieve data based on the column name and its minimum and maximum value. I am not able to figure out how to get that result. I am able to get data based on column name but don't understand how to apply the limit.
Column name and corresponding min and max value given in list and tuple.
import pandas as pd
import numpy as np
def c_cutoff(data_frame, column_cutoff):
selected_data = data_frame.loc[:, [X[0] for X in column_cutoff]]
return selected_data
np.random.seed(5)
df = pd.DataFrame(np.random.randint(100, size=(100, 6)),
columns=list('ABCDEF'),
index=['R{}'.format(i) for i in range(100)])
column_cutoffdata = [('B',27,78),('E',44,73)]
newdata_cutoff = c_cutoff(df,column_cutoffdata)
print(df.head())
print(newdata_cutoff)
result
B E
R0 78 73
R1 27 7
R2 53 44
R3 65 84
R4 9 1
..
.
Expected output
I want all value less than 27 and greater than 78 should be discarded, same for E
You can be rather explicit and do the following:
lim = [('B',27,78),('E',44,73)]
for lim in limiters:
df = df[(df[lim[0]]>=lim[1]) & (df[lim[0]]<=lim[2])]
Yields:
A B C D E F
R0 99 78 61 16 73 8
R2 15 53 80 27 44 77
R8 30 62 11 67 65 55
R11 90 31 9 38 47 16
R15 16 64 8 90 44 37
R16 94 75 5 22 52 69
R46 11 30 26 8 51 61
R48 39 59 22 80 58 44
R66 55 38 5 49 58 15
R70 36 78 5 13 73 69
R72 70 58 52 99 67 11
R75 20 59 57 33 53 96
R77 32 31 89 49 69 41
R79 43 28 17 16 73 54
R80 45 34 90 67 69 70
R87 9 50 16 61 65 30
R90 43 56 76 7 47 62
pipe + where + between
You can't discard values in an array; that would involve reshaping an array and a dataframe's columns must all have the same size.
But you can iterate and use pd.Series.where to replace out-of-scope vales with NaN. Note the Pandas way to feed a dataframe through a function is via pipe:
import pandas as pd
import numpy as np
def c_cutoff(data_frame, column_cutoff):
for col, min_val, max_val in column_cutoffdata:
data_frame[col] = data_frame[col].where(data_frame[col].between(min_val, max_val))
return data_frame
np.random.seed(5)
df = pd.DataFrame(np.random.randint(100, size=(100, 6)),
columns=list('ABCDEF'),
index=['R{}'.format(i) for i in range(100)])
column_cutoffdata = [('B',27,78),('E',44,73)]
print(df.head())
# A B C D E F
# R0 99 78 61 16 73 8
# R1 62 27 30 80 7 76
# R2 15 53 80 27 44 77
# R3 75 65 47 30 84 86
# R4 18 9 41 62 1 82
newdata_cutoff = df.pipe(c_cutoff, column_cutoffdata)
print(newdata_cutoff.head())
# A B C D E F
# R0 99 78.0 61 16 73.0 8
# R1 62 27.0 30 80 NaN 76
# R2 15 53.0 80 27 44.0 77
# R3 75 65.0 47 30 NaN 86
# R4 18 NaN 41 62 NaN 82
If you want to drop rows with any NaN values, you can then use dropna:
newdata_cutoff = newdata_cutoff.dropna()

Rolling average across several columns and rows

import random
random.sample(range(1, 100), 10)
df = pd.DataFrame({"A": random.sample(range(1, 100), 10),
"B":random.sample(range(1, 100), 10),
"C":random.sample(range(1, 100), 10)})
df["D"]="need_to_calc"
df
I need the value of Column D, Row 9 to equal the average of the block of cells from rows 6 through 8 across columns A through C. I want to do this for all rows.
I am not sure how to do this in a single pythonic action. Instead I have hacky temporary columns and ugly nonsense.
Is there a cleaner way to define this column without temporary tables?
You can do it like this:
means = df.rolling(3).mean().shift(1)
df['D'] = (means['A'] + means['B'] + means['C'])/3
Output:
A B C D
0 43 57 15 NaN
1 86 34 68 NaN
2 40 12 78 NaN
3 97 24 54 48.111111
4 90 42 10 54.777778
5 34 54 98 49.666667
6 98 36 31 55.888889
7 16 5 24 54.777778
8 35 53 67 44.000000
9 80 66 37 40.555556
You can do it so:
df["D"]= (df.sum(axis=1).rolling(window=3, min_periods=3).sum()/9).shift(1)
Example:
A B C D
0 62 89 12 need_to_calc
1 44 13 63 need_to_calc
2 28 21 54 need_to_calc
3 93 93 4 need_to_calc
4 95 84 42 need_to_calc
5 68 68 35 need_to_calc
6 3 92 56 need_to_calc
7 13 88 83 need_to_calc
8 22 37 23 need_to_calc
9 64 58 5 need_to_calc
Output:
A B C D
0 62 89 12 NaN
1 44 13 63 NaN
2 28 21 54 NaN
3 93 93 4 42.888889
4 95 84 42 45.888889
5 68 68 35 57.111111
6 3 92 56 64.666667
7 13 88 83 60.333333
8 22 37 23 56.222222
9 64 58 5 46.333333

Split a Pandas Dataframe into multiple Dataframes based on Triangular Number Series

I have a DataFrame (df) and I need to split it into n number of Dataframes based on the column numbers. But, it has to follow the Triangular Series pattern:
df1 = df[[0]]
df2 = df[[1,2]]
df3 = df[[3,4,5]]
df4 = df[[6,7,8,9]]
etc.
Consider the dataframe df
df = pd.DataFrame(
np.arange(100).reshape(10, 10),
columns=list('ABCDEFGHIJ')
)
df
A B C D E F G H I J
0 0 1 2 3 4 5 6 7 8 9
1 10 11 12 13 14 15 16 17 18 19
2 20 21 22 23 24 25 26 27 28 29
3 30 31 32 33 34 35 36 37 38 39
4 40 41 42 43 44 45 46 47 48 49
5 50 51 52 53 54 55 56 57 58 59
6 60 61 62 63 64 65 66 67 68 69
7 70 71 72 73 74 75 76 77 78 79
8 80 81 82 83 84 85 86 87 88 89
9 90 91 92 93 94 95 96 97 98 99
i_s, j_s = np.arange(4).cumsum(), np.arange(1, 5).cumsum()
df1, df2, df3, df4 = [
df.iloc[:, i:j] for i, j in zip(i_s, j_s)
]
Verify
pd.concat(dict(enumerate([df.iloc[:, i:j] for i, j in zip(i_s, j_s)])), axis=1)
0 1 2 3
A B C D E F G H I J
0 0 1 2 3 4 5 6 7 8 9
1 10 11 12 13 14 15 16 17 18 19
2 20 21 22 23 24 25 26 27 28 29
3 30 31 32 33 34 35 36 37 38 39
4 40 41 42 43 44 45 46 47 48 49
5 50 51 52 53 54 55 56 57 58 59
6 60 61 62 63 64 65 66 67 68 69
7 70 71 72 73 74 75 76 77 78 79
8 80 81 82 83 84 85 86 87 88 89
9 90 91 92 93 94 95 96 97 98 99
first get Triangular Number Series, then apply it to dataframe
n = len(df.columns.tolist())
end = 0
i = 0
res = []
while end < n:
begin = end
end = i*(i+1)/2
res.append(begin,end)
idx = map( lambda x:range(x),res)
for i in idx:
df[i]

Issue with merging time series variables to create new DataFrame with arbitrary index

So I am trying to merge the following columns of data which are currently indexed as daily entries (but only have points once per week). I have separated the columns into year variables but am having trouble getting them into a combined dataframe and disregard the date index so that I can build out min/max columns by week over the years. I am not sure how to get merge/join function to do this.
#Create year variables, append to new dataframe with new index
I have the following:
def minmaxdata():
Totrigs = dataforgraphs()
tr = Totrigs
yrs=[tr['2007'],tr['2008'],tr['2009'],tr['2010'],tr['2011'],tr['2012'],tr['2013'],tr['2014']]
yrlist = ['tr07','tr08','tr09','tr10','tr11','tr12','tr13','tr14']
dic = dict(zip(yrlist,yrs))
yr07,yr08,yr09,yr10,yr11,yr12,yr13,yr14 =dic['tr07'],dic['tr08'],dic['tr09'],dic['tr10'],dic['tr11'],dic['tr12'],dic['tr13'],dic['tr14']
minmax = yr07.append([yr08,yr09,yr10,yr11,yr12,yr13,yr14],ignore_index=True)
I would like a Dataframe like the following:
2007 2008 2009 2010 2011 2012 2013 2014 min max
1 10 13 10 12 34 23 22 14 10 34
2 25 ...
3 22
4 ...
5
.
.
. ...
52
I'm not sure what your original data look like, but I don't think it's a good idea to hard-code all years. You lose re-usability. I'll setup a sequence of random integers indexed by date with one date per week.
In [65]: idx = pd.date_range ('2007-1-1','2014-12-31',freq='W')
In [66]: df = pd.DataFrame(np.random.randint(100, size=len(idx)), index=idx, columns=['value'])
In [67]: df.head()
Out[67]:
value
2007-01-07 7
2007-01-14 2
2007-01-21 85
2007-01-28 55
2007-02-04 36
In [68]: df.tail()
Out[68]:
value
2014-11-30 76
2014-12-07 34
2014-12-14 43
2014-12-21 26
2014-12-28 17
Then get year of the week:
In [69]: df['year'] = df.index.year
In [70]: df['week'] = df.groupby('year').cumcount()+1
(You may try df.index.week for week# but I've seen weird behavior like starting from week #53 in Jan.)
Finally, do a pivot table to transform and get row-wise max/min:
In [71]: df2 = df.pivot_table(index='week', columns='year', values='value')
In [72]: df2['max'] = df2.max(axis=1)
In [73]: df2['min'] = df2.min(axis=1)
And now our dataframe df2 looks like this and should be what you need:
In [74]: df2
Out[74]:
year 2007 2008 2009 2010 2011 2012 2013 2014 max min
week
1 7 82 13 32 24 58 18 10 82 7
2 2 5 29 0 2 97 59 83 97 0
3 85 89 8 83 63 73 47 49 89 8
4 55 5 1 44 78 10 13 87 87 1
5 36 41 48 98 98 24 24 69 98 24
6 51 43 62 60 44 57 34 33 62 33
7 37 66 72 46 28 11 73 36 73 11
8 30 13 86 93 46 67 95 15 95 13
9 78 84 16 21 70 39 43 90 90 16
10 9 2 88 15 39 81 44 96 96 2
11 34 76 16 44 44 26 30 77 77 16
12 2 24 23 13 25 69 25 74 74 2
13 66 91 67 77 18 47 95 66 95 18
14 59 52 22 42 40 99 88 21 99 21
15 76 17 31 57 43 31 91 67 91 17
16 76 38 53 43 84 45 78 9 84 9
17 88 53 34 22 99 93 61 42 99 22
18 78 19 82 19 5 80 55 69 82 5
19 54 92 56 6 2 85 7 67 92 2
20 8 56 86 41 60 76 31 81 86 8
21 64 76 11 38 41 98 39 72 98 11
22 21 86 34 1 15 27 26 95 95 1
23 82 90 3 17 62 18 93 20 93 3
24 47 42 32 27 83 8 22 14 83 8
25 15 66 70 16 4 22 26 14 70 4
26 12 68 21 7 86 2 27 10 86 2
27 85 85 9 39 17 94 67 42 94 9
28 73 80 96 49 46 23 69 84 96 23
29 57 74 6 71 79 31 79 7 79 6
30 18 84 85 34 71 69 0 62 85 0
31 24 40 93 53 72 46 44 71 93 24
32 95 4 58 57 68 27 95 71 95 4
33 65 84 87 41 38 45 71 33 87 33
34 62 14 41 83 79 63 44 13 83 13
35 49 96 50 62 25 45 69 63 96 25
36 6 38 86 34 98 60 67 80 98 6
37 99 44 26 19 19 20 57 17 99 17
38 2 40 7 65 68 58 68 13 68 2
39 72 31 83 65 69 39 10 76 83 10
40 90 31 42 20 7 8 62 79 90 7
41 10 46 82 96 30 43 12 84 96 10
42 79 38 28 78 25 9 80 2 80 2
43 64 83 63 40 29 86 10 15 86 10
44 89 91 62 48 53 69 16 0 91 0
45 99 26 85 45 26 53 79 86 99 26
46 35 14 46 25 74 6 68 44 74 6
47 17 9 84 88 29 83 85 1 88 1
48 18 69 55 16 77 35 16 76 77 16
49 60 4 36 50 81 28 50 34 81 4
50 36 29 38 28 81 86 71 43 86 28
51 41 82 95 27 95 77 74 26 95 26
52 2 81 89 82 28 2 11 17 89 2
53 NaN NaN NaN NaN NaN 0 NaN NaN 0 0
EDIT:
If you need max/min over a certain columns, just list them. In this case (2007-2013), they are consecutive so you can do the following.
df2['max_2007to2013'] = df2[range(2007,2014)].max(axis=1)
If not, simply list them like: df2[[2007,2010,2012,2013]].max(axis=1)

Categories

Resources