I recently tried to strip ("\t") values from a dataframe.
def FillMissing(dataFrame):
dataFrame.replace("\?",np.nan,inplace=True,regex=True)
dataFrame[:] = dataFrame.apply(lambda x: x.str.strip("\t") \
if x.dtype == "object" else x)
print(id(dataFrame))
.......
(test_df,obj,int_,float_)= FillMissing(test_df)
print(id(test_df))
O/P-
ID(dataFrame) - 4647827832
htn-->['yes' 'no']
dm-->['yes' 'no' ' yes']
cad-->['no' 'yes']
appet-->['good' 'poor']
pe-->['no' 'yes']
ane-->['no' 'yes']
ID(test_df) - 4647827832
And by slightly modifying the method , The O/P changes
def FillMissing(dataFrame):
dataFrame.replace("\?",np.nan,inplace=True,regex=True)
***dataFrame = dataFrame.apply(lambda x: x.str.strip("\t") \
if x.dtype == "object" else x)***
print(id(dataFrame))
.......
(test_df,obj,int_,float_)= FillMissing(test_df)
print(id(test_df))
O/P -
ID(dataFrame) - 4647827832
htn-->['yes' 'no' nan]
dm-->['yes' 'no' ' yes' nan]
cad-->['no' 'yes' nan]
appet-->['good' 'poor' nan]
pe-->['no' 'yes' nan]
ID(test_df) - 4517467528
I m unable to understand the difference , in the statements.
Also while executing the strip outside the function on the test_df it works completely fine.
Snapshot of Data used--
id age bp sg al su rbc pc pcc \
0 0 48.0 80.0 1.020 1.0 0.0 normal normal notpresent
1 1 7.0 50.0 1.020 4.0 0.0 normal normal notpresent
2 2 62.0 80.0 1.010 2.0 3.0 normal normal notpresent
3 3 48.0 70.0 1.005 4.0 0.0 normal abnormal present
4 4 51.0 80.0 1.010 2.0 0.0 normal normal notpresent
5 5 60.0 90.0 1.015 3.0 0.0 normal normal notpresent
6 6 68.0 70.0 1.010 0.0 0.0 normal normal notpresent
7 7 24.0 80.0 1.015 2.0 4.0 normal abnormal notpresent
8 8 52.0 100.0 1.015 3.0 0.0 normal abnormal present
9 9 53.0 90.0 1.020 2.0 0.0 abnormal abnormal present
10 10 50.0 60.0 1.010 2.0 4.0 abnormal abnormal present
11 11 63.0 70.0 1.010 3.0 0.0 abnormal abnormal present
12 12 68.0 70.0 1.015 3.0 1.0 normal normal present
13 13 68.0 70.0 1.020 0.0 0.0 normal abnormal notpresent
14 14 68.0 80.0 1.010 3.0 2.0 normal abnormal present
15 15 40.0 80.0 1.015 3.0 0.0 abnormal normal notpresent
16 16 47.0 70.0 1.015 2.0 0.0 abnormal normal notpresent
17 17 47.0 80.0 1.020 0.0 0.0 abnormal normal notpresent
18 18 60.0 100.0 1.025 0.0 3.0 abnormal normal notpresent
19 19 62.0 60.0 1.015 1.0 0.0 abnormal abnormal present
20 20 61.0 80.0 1.015 2.0 0.0 abnormal abnormal notpresent
21 21 60.0 90.0 1.020 0.0 0.0 normal abnormal notpresent
22 22 48.0 80.0 1.025 4.0 0.0 normal abnormal notpresent
23 23 21.0 70.0 1.010 0.0 0.0 normal normal notpresent
24 24 42.0 100.0 1.015 4.0 0.0 normal abnormal notpresent
25 25 61.0 60.0 1.025 0.0 0.0 normal normal notpresent
26 26 75.0 80.0 1.015 0.0 0.0 normal normal notpresent
27 27 69.0 70.0 1.010 3.0 4.0 normal abnormal notpresent
28 28 75.0 70.0 1.020 1.0 3.0 abnormal abnormal notpresent
29 29 68.0 70.0 1.005 1.0 0.0 abnormal abnormal present
Related
I am trying to replicate countifs in excel to get a rank between two unique values that are listed in my dataframe. I have attached the expected output calculated in excel using countif and let/rank functions.
I am trying to generate "average rank of gas and coal plants" that takes the number from the "average rank column" and then ranks the two unique types from technology (CCGT or COAL) into two new ranks (Gas or Coal) so then I can get the relavant quantiles for this. In case you are wondering why I would need to do this seeing as there are only two coal plants, well when I run this model on a larger dataset it will be useful to know how to do this in code and not manually on my dataset.
Ideally the output will return two ranks 1-47 for all units with technology == CCGT and 1-2 for all units with technology == COAL.
This is the column I am looking to make
Unit ID
Technology
03/01/2022
04/01/2022
05/01/2022
06/01/2022
07/01/2022
08/01/2022
Average Rank
Unit Rank
Avg Rank of Gas & Coal plants
Gas Quintiles
Coal Quintiles
Quintiles
FAWN-1
CCGT
1.0
5.0
1.0
5.0
2.0
1.0
2.5
1
1
1
0
Gas_1
GRAI-6
CCGT
4.0
18.0
2.0
4.0
3.0
3.0
5.7
2
2
1
0
Gas_1
EECL-1
CCGT
5.0
29.0
4.0
1.0
1.0
2.0
7.0
3
3
1
0
Gas_1
PEMB-21
CCGT
7.0
1.0
6.0
13.0
8.0
8.0
7.2
4
4
1
0
Gas_1
PEMB-51
CCGT
3.0
3.0
3.0
11.0
16.0
7.2
5
5
1
0
Gas_1
PEMB-41
CCGT
9.0
4.0
7.0
7.0
10.0
13.0
8.3
6
6
1
0
Gas_1
WBURB-1
CCGT
6.0
9.0
22.0
2.0
7.0
5.0
8.5
7
7
1
0
Gas_1
PEMB-31
CCGT
14.0
6.0
13.0
6.0
4.0
9.0
8.7
8
8
1
0
Gas_1
GRMO-1
CCGT
2.0
7.0
10.0
24.0
11.0
6.0
10.0
9
9
1
0
Gas_1
PEMB-11
CCGT
21.0
2.0
9.0
10.0
9.0
14.0
10.8
10
10
2
0
Gas_2
STAY-1
CCGT
19.0
12.0
5.0
23.0
6.0
7.0
12.0
11
11
2
0
Gas_2
GRAI-7
CCGT
10.0
27.0
15.0
9.0
15.0
11.0
14.5
12
12
2
0
Gas_2
DIDCB6
CCGT
28.0
11.0
11.0
8.0
19.0
15.0
15.3
13
13
2
0
Gas_2
SCCL-3
CCGT
17.0
16.0
31.0
3.0
18.0
10.0
15.8
14
14
2
0
Gas_2
STAY-4
CCGT
12.0
8.0
20.0
18.0
14.0
23.0
15.8
14
14
2
0
Gas_2
CDCL-1
CCGT
13.0
22.0
8.0
25.0
12.0
16.0
16.0
16
16
2
0
Gas_2
STAY-3
CCGT
8.0
17.0
17.0
20.0
13.0
22.0
16.2
17
17
2
0
Gas_2
MRWD-1
CCGT
19.0
26.0
5.0
19.0
17.3
18
18
2
0
Gas_2
WBURB-3
CCGT
24.0
14.0
17.0
17.0
18.0
19
19
3
0
Gas_3
WBURB-2
CCGT
14.0
21.0
12.0
31.0
18.0
19.2
20
20
3
0
Gas_3
GYAR-1
CCGT
26.0
14.0
17.0
20.0
21.0
19.6
21
21
3
0
Gas_3
STAY-2
CCGT
18.0
20.0
18.0
21.0
24.0
20.0
20.2
22
22
3
0
Gas_3
KLYN-A-1
CCGT
24.0
12.0
19.0
27.0
20.5
23
23
3
0
Gas_3
SHOS-1
CCGT
16.0
15.0
28.0
15.0
29.0
27.0
21.7
24
24
3
0
Gas_3
DIDCB5
CCGT
10.0
35.0
22.0
22.3
25
25
3
0
Gas_3
CARR-1
CCGT
33.0
26.0
27.0
22.0
4.0
22.4
26
26
3
0
Gas_3
LAGA-1
CCGT
15.0
13.0
29.0
32.0
23.0
24.0
22.7
27
27
3
0
Gas_3
CARR-2
CCGT
24.0
25.0
27.0
29.0
21.0
12.0
23.0
28
28
3
0
Gas_3
GRAI-8
CCGT
11.0
28.0
36.0
16.0
26.0
25.0
23.7
29
29
4
0
Gas_4
SCCL-2
CCGT
29.0
16.0
28.0
25.0
24.5
30
30
4
0
Gas_4
LBAR-1
CCGT
19.0
25.0
31.0
28.0
25.8
31
31
4
0
Gas_4
CNQPS-2
CCGT
20.0
32.0
32.0
26.0
27.5
32
32
4
0
Gas_4
SPLN-1
CCGT
23.0
30.0
30.0
27.7
33
33
4
0
Gas_4
DAMC-1
CCGT
23.0
21.0
38.0
34.0
29.0
34
34
4
0
Gas_4
KEAD-2
CCGT
30.0
30.0
35
35
4
0
Gas_4
SHBA-1
CCGT
26.0
23.0
35.0
37.0
30.3
36
36
4
0
Gas_4
HUMR-1
CCGT
22.0
30.0
37.0
37.0
33.0
28.0
31.2
37
37
4
0
Gas_4
CNQPS-4
CCGT
27.0
33.0
35.0
30.0
31.3
38
38
5
0
Gas_5
CNQPS-1
CCGT
25.0
40.0
33.0
32.7
39
39
5
0
Gas_5
SEAB-1
CCGT
32.0
34.0
36.0
29.0
32.8
40
40
5
0
Gas_5
PETEM1
CCGT
35.0
35.0
41
41
5
0
Gas_5
ROCK-1
CCGT
31.0
34.0
38.0
38.0
35.3
42
42
5
0
Gas_5
SEAB-2
CCGT
31.0
39.0
39.0
34.0
35.8
43
43
5
0
Gas_5
WBURB-43
COAL
32.0
37.0
40.0
39.0
31.0
35.8
44
1
0
1
Coal_1
FDUNT-1
CCGT
36.0
36.0
45
44
5
0
Gas_5
COSO-1
CCGT
30.0
42.0
36.0
36.0
45
44
5
0
Gas_5
WBURB-41
COAL
33.0
38.0
41.0
40.0
32.0
36.8
47
2
0
1
Coal_1
FELL-1
CCGT
34.0
39.0
43.0
41.0
33.0
38.0
48
46
5
0
Gas_5
KEAD-1
CCGT
43.0
43.0
49
47
5
0
Gas_5
I have tried to do it the same way I got average rank, which is a rank of the average of inputs in the dataframe but it doesn't seem to work with additional conditions.
Thank you!!
import pandas as pd
df = pd.read_csv("gas.csv")
display(df['Technology'].value_counts())
print('------')
display(df['Technology'].value_counts()[0]) # This is how you access count of CCGT
display(df['Technology'].value_counts()[1])
Output:
CCGT 47
COAL 2
Name: Technology, dtype: int64
------
47
2
By the way: pd.cut or pd.qcut can be used to calculate quantiles. You don't have to manually define what a quantile is.
Refer to the documentation and other websites:
https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.qcut.html
https://www.geeksforgeeks.org/how-to-use-pandas-cut-and-qcut/
There are many methods you can pass to rank. Refer to documentation:
https://pandas.pydata.org/docs/reference/api/pandas.Series.rank.html
df['rank'] = df.groupby("Technology")["Average Rank"].rank(method = "dense", ascending = True)
df
method{‘average’, ‘min’, ‘max’, ‘first’, ‘dense’}, default ‘average’
How to rank the group of records that have the same value (i.e. ties):
average: average rank of the group
min: lowest rank in the group
max: highest rank in the group
first: ranks assigned in order they appear in the array
dense: like ‘min’, but rank always increases by 1 between groups.
I have a randomly generated 10*10 dataset and I need to replace 10% of dataset randomly with NaN.
import pandas as pd
import numpy as np
Dataset = pd.DataFrame(np.random.randint(0, 100, size=(10, 10)))
Try the following method. I had used this when I was setting up a hackathon and needed to inject missing data for the competition. -
You can use np.random.choice to create a mask of the same shape as the dataframe. Just make sure to set the percentage of the choice p for True and False values where True represents the values that will be replaced by nans.
Then simply apply the mask using df.mask
import pandas as pd
import numpy as np
p = 0.1 #percentage missing data required
df = pd.DataFrame(np.random.randint(0,100,size=(10,10)))
mask = np.random.choice([True, False], size=df.shape, p=[p,1-p])
new_df = df.mask(mask)
print(new_df)
0 1 2 3 4 5 6 7 8 9
0 50.0 87 NaN 14 78.0 44.0 19.0 94 28 28.0
1 NaN 58 3.0 75 90.0 NaN 29.0 11 47 NaN
2 91.0 30 98.0 77 3.0 72.0 74.0 42 69 75.0
3 68.0 92 90.0 90 NaN 60.0 74.0 72 58 NaN
4 39.0 51 NaN 81 67.0 43.0 33.0 37 13 40.0
5 73.0 0 59.0 77 NaN NaN 21.0 74 55 98.0
6 33.0 64 0.0 59 27.0 32.0 17.0 3 31 43.0
7 75.0 56 21.0 9 81.0 92.0 89.0 82 89 NaN
8 53.0 44 49.0 31 76.0 64.0 NaN 23 37 NaN
9 65.0 15 31.0 21 84.0 7.0 24.0 3 76 34.0
EDIT:
Updated my answer for the exact 10% values that you are looking for. It uses itertools and sample to get a set of indexes to mask, and then sets them to nan values. Should be exact as you expected.
from itertools import product
from random import sample
p = 0.1
n = int(df.shape[0]*df.shape[1]*p) #Calculate count of nans
#Sample exactly n indexes
ids = sample(list(product(range(df.shape[0]), range(df.shape[1]))), n)
idx, idy = list(zip(*ids))
data = df.to_numpy().astype(float) #Get data as numpy
data[idx, idy]=np.nan #Update numpy view with np.nan
#Assign to new dataframe
new_df = pd.DataFrame(data, columns=df.columns, index=df.index)
print(new_df)
0 1 2 3 4 5 6 7 8 9
0 52.0 50.0 24.0 81.0 10.0 NaN NaN 75.0 14.0 81.0
1 45.0 3.0 61.0 67.0 93.0 NaN 90.0 34.0 39.0 4.0
2 1.0 NaN NaN 71.0 57.0 88.0 8.0 9.0 62.0 20.0
3 78.0 3.0 82.0 1.0 75.0 50.0 33.0 66.0 52.0 8.0
4 11.0 46.0 58.0 23.0 NaN 64.0 47.0 27.0 NaN 21.0
5 70.0 35.0 54.0 NaN 70.0 82.0 69.0 94.0 20.0 NaN
6 54.0 84.0 16.0 76.0 77.0 50.0 82.0 31.0 NaN 31.0
7 71.0 79.0 93.0 11.0 46.0 27.0 19.0 84.0 67.0 30.0
8 91.0 85.0 63.0 1.0 91.0 79.0 80.0 14.0 75.0 1.0
9 50.0 34.0 8.0 8.0 10.0 56.0 49.0 45.0 39.0 13.0
I'm using pandas in Python, and I have performed some crosstab calculations and concatenations, and at the end up with a data frame that looks like this:
ID 5 6 7 8 9 10 11 12 13
Total 87.0 3.0 9.0 6.0 92.0 7.0 3.0 3.0 20.0
Regular 72.0 2.0 8.0 5.0 81.0 7.0 3.0 3.0 18.0
CR 22.0 0.0 0.0 0.0 17.0 0.0 0.0 0.0 3.0
HDG 20.0 0.0 0.0 0.0 24.0 4.0 0.0 0.0 1.0
PPG 30.0 2.0 8.0 5.0 40.0 3.0 3.0 3.0 14.0
Superior 15.0 1.0 1.0 1.0 11.0 0.0 0.0 0.0 2.0
CR 3.0 0.0 0.0 0.0 2.0 0.0 0.0 0.0 0.0
HDG 5.0 1.0 1.0 1.0 4.0 0.0 0.0 0.0 0.0
PPG 7.0 0.0 0.0 0.0 5.0 0.0 0.0 0.0 2.0
The problem is that I want the last 4 rows, that start with Superior to be places before Total row. So, simply I want to switch the positions of last 4 rows with the 4 rows that start with Regular. How can I achieve this in pandas? So that I get this:
ID 5 6 7 8 9 10 11 12 13
Total 87.0 3.0 9.0 6.0 92.0 7.0 3.0 3.0 20.0
Superior 15.0 1.0 1.0 1.0 11.0 0.0 0.0 0.0 2.0
CR 3.0 0.0 0.0 0.0 2.0 0.0 0.0 0.0 0.0
HDG 5.0 1.0 1.0 1.0 4.0 0.0 0.0 0.0 0.0
PPG 7.0 0.0 0.0 0.0 5.0 0.0 0.0 0.0 2.0
Regular 72.0 2.0 8.0 5.0 81.0 7.0 3.0 3.0 18.0
CR 22.0 0.0 0.0 0.0 17.0 0.0 0.0 0.0 3.0
HDG 20.0 0.0 0.0 0.0 24.0 4.0 0.0 0.0 1.0
PPG 30.0 2.0 8.0 5.0 40.0 3.0 3.0 3.0 14.0
More generalized solution Categorical and argsort, I know this df was ordered , so ffill is safe here
s=df.ID
s=s.where(s.isin(['Total','Regular','Superior'])).ffill()
s=pd.Categorical(s,['Total','Superior','Regular'],ordered=True)
df=df.iloc[np.argsort(s)]
df
Out[188]:
ID 5 6 7 8 9 10 11 12 13
0 Total 87.0 3.0 9.0 6.0 92.0 7.0 3.0 3.0 20.0
5 Superior 15.0 1.0 1.0 1.0 11.0 0.0 0.0 0.0 2.0
6 CR 3.0 0.0 0.0 0.0 2.0 0.0 0.0 0.0 0.0
7 HDG 5.0 1.0 1.0 1.0 4.0 0.0 0.0 0.0 0.0
8 PPG 7.0 0.0 0.0 0.0 5.0 0.0 0.0 0.0 2.0
1 Regular 72.0 2.0 8.0 5.0 81.0 7.0 3.0 3.0 18.0
2 CR 22.0 0.0 0.0 0.0 17.0 0.0 0.0 0.0 3.0
3 HDG 20.0 0.0 0.0 0.0 24.0 4.0 0.0 0.0 1.0
4 PPG 30.0 2.0 8.0 5.0 40.0 3.0 3.0 3.0 14.0
Here's one way:
import numpy as np
df.iloc[1:,:] = np.roll(df.iloc[1:,:].values, 4, axis=0)
ID 5 6 7 8 9 10 11 12 13
0 Total 87.0 3.0 9.0 6.0 92.0 7.0 3.0 3.0 20.0
1 Superior 15.0 1.0 1.0 1.0 11.0 0.0 0.0 0.0 2.0
2 CR 3.0 0.0 0.0 0.0 2.0 0.0 0.0 0.0 0.0
3 HDG 5.0 1.0 1.0 1.0 4.0 0.0 0.0 0.0 0.0
4 PPG 7.0 0.0 0.0 0.0 5.0 0.0 0.0 0.0 2.0
5 Regular 72.0 2.0 8.0 5.0 81.0 7.0 3.0 3.0 18.0
6 CR 22.0 0.0 0.0 0.0 17.0 0.0 0.0 0.0 3.0
7 HDG 20.0 0.0 0.0 0.0 24.0 4.0 0.0 0.0 1.0
8 PPG 30.0 2.0 8.0 5.0 40.0 3.0 3.0 3.0 14.0
For a specific answer to this question, just use iloc
df.iloc[[0,5,6,7,8,1,2,3,4],:]
For a more generalized solution,
m = (df.ID.eq('Superior') | df.ID.eq('Regular')).cumsum()
pd.concat([df[m==0], df[m==2], df[m==1]])
or
order = (2,1)
pd.concat([df[m==0], *[df[m==c] for c in order]])
where order defines the mapping from previous ordering to new ordering.
According to Ender's Applied Econometric Time Series, the second difference of a variable y is defined as:
Pandas provides the diff function that receives "periods" as an argument. Nevertheless, df.diff(2) gives a different result than df.diff().diff().
Code excerpt showing the above:
In [8]: df
Out[8]:
C.1 C.2 C.3 C.4 C.5 C.6
C.0
1990 16.0 6.0 256.0 216.0 65536 4352
1991 17.0 7.0 289.0 343.0 131072 5202
1992 6.0 -4.0 36.0 -64.0 64 252
1993 7.0 -3.0 49.0 -27.0 128 392
1994 8.0 -2.0 64.0 -8.0 256 576
1995 13.0 3.0 169.0 27.0 8192 2366
1996 10.0 0.5 100.0 0.5 1024 1100
1997 11.0 1.0 121.0 1.0 2048 1452
1998 4.0 -6.0 16.0 -216.0 16 80
1999 5.0 -5.0 25.0 -125.0 32 150
2000 18.0 8.0 324.0 512.0 262144 6156
2001 3.0 -7.0 9.0 -343.0 8 36
2002 0.5 -10.0 0.5 -1000.0 48 20
2003 1.0 -9.0 1.0 -729.0 2 2
2004 14.0 4.0 196.0 64.0 16384 2940
2005 15.0 5.0 225.0 125.0 32768 3600
2006 12.0 2.0 144.0 8.0 4096 1872
2007 9.0 -1.0 81.0 -1.0 512 810
2008 2.0 -8.0 4.0 -512.0 4 12
2009 19.0 9.0 361.0 729.0 524288 7220
In [9]: df.diff(2)
Out[9]:
C.1 C.2 C.3 C.4 C.5 C.6
C.0
1990 NaN NaN NaN NaN NaN NaN
1991 NaN NaN NaN NaN NaN NaN
1992 -10.0 -10.0 -220.0 -280.0 -65472.0 -4100.0
1993 -10.0 -10.0 -240.0 -370.0 -130944.0 -4810.0
1994 2.0 2.0 28.0 56.0 192.0 324.0
1995 6.0 6.0 120.0 54.0 8064.0 1974.0
1996 2.0 2.5 36.0 8.5 768.0 524.0
1997 -2.0 -2.0 -48.0 -26.0 -6144.0 -914.0
1998 -6.0 -6.5 -84.0 -216.5 -1008.0 -1020.0
1999 -6.0 -6.0 -96.0 -126.0 -2016.0 -1302.0
2000 14.0 14.0 308.0 728.0 262128.0 6076.0
2001 -2.0 -2.0 -16.0 -218.0 -24.0 -114.0
2002 -17.5 -18.0 -323.5 -1512.0 -262096.0 -6136.0
2003 -2.0 -2.0 -8.0 -386.0 -6.0 -34.0
2004 13.5 14.0 195.5 1064.0 16336.0 2920.0
2005 14.0 14.0 224.0 854.0 32766.0 3598.0
2006 -2.0 -2.0 -52.0 -56.0 -12288.0 -1068.0
2007 -6.0 -6.0 -144.0 -126.0 -32256.0 -2790.0
2008 -10.0 -10.0 -140.0 -520.0 -4092.0 -1860.0
2009 10.0 10.0 280.0 730.0 523776.0 6410.0
In [10]: df.diff().diff()
Out[10]:
C.1 C.2 C.3 C.4 C.5 C.6
C.0
1990 NaN NaN NaN NaN NaN NaN
1991 NaN NaN NaN NaN NaN NaN
1992 -12.0 -12.0 -286.0 -534.0 -196544.0 -5800.0
1993 12.0 12.0 266.0 444.0 131072.0 5090.0
1994 0.0 0.0 2.0 -18.0 64.0 44.0
1995 4.0 4.0 90.0 16.0 7808.0 1606.0
1996 -8.0 -7.5 -174.0 -61.5 -15104.0 -3056.0
1997 4.0 3.0 90.0 27.0 8192.0 1618.0
1998 -8.0 -7.5 -126.0 -217.5 -3056.0 -1724.0
1999 8.0 8.0 114.0 308.0 2048.0 1442.0
2000 12.0 12.0 290.0 546.0 262096.0 5936.0
2001 -28.0 -28.0 -614.0 -1492.0 -524248.0 -12126.0
2002 12.5 12.0 306.5 198.0 262176.0 6104.0
2003 3.0 4.0 9.0 928.0 -86.0 -2.0
2004 12.5 12.0 194.5 522.0 16428.0 2956.0
2005 -12.0 -12.0 -166.0 -732.0 2.0 -2278.0
2006 -4.0 -4.0 -110.0 -178.0 -45056.0 -2388.0
2007 0.0 0.0 18.0 108.0 25088.0 666.0
2008 -4.0 -4.0 -14.0 -502.0 3076.0 264.0
2009 24.0 24.0 434.0 1752.0 524792.0 8006.0
In [11]: df.diff(2) - df.diff().diff()
Out[11]:
C.1 C.2 C.3 C.4 C.5 C.6
C.0
1990 NaN NaN NaN NaN NaN NaN
1991 NaN NaN NaN NaN NaN NaN
1992 2.0 2.0 66.0 254.0 131072.0 1700.0
1993 -22.0 -22.0 -506.0 -814.0 -262016.0 -9900.0
1994 2.0 2.0 26.0 74.0 128.0 280.0
1995 2.0 2.0 30.0 38.0 256.0 368.0
1996 10.0 10.0 210.0 70.0 15872.0 3580.0
1997 -6.0 -5.0 -138.0 -53.0 -14336.0 -2532.0
1998 2.0 1.0 42.0 1.0 2048.0 704.0
1999 -14.0 -14.0 -210.0 -434.0 -4064.0 -2744.0
2000 2.0 2.0 18.0 182.0 32.0 140.0
2001 26.0 26.0 598.0 1274.0 524224.0 12012.0
2002 -30.0 -30.0 -630.0 -1710.0 -524272.0 -12240.0
2003 -5.0 -6.0 -17.0 -1314.0 80.0 -32.0
2004 1.0 2.0 1.0 542.0 -92.0 -36.0
2005 26.0 26.0 390.0 1586.0 32764.0 5876.0
2006 2.0 2.0 58.0 122.0 32768.0 1320.0
2007 -6.0 -6.0 -162.0 -234.0 -57344.0 -3456.0
2008 -6.0 -6.0 -126.0 -18.0 -7168.0 -2124.0
2009 -14.0 -14.0 -154.0 -1022.0 -1016.0 -1596.0
Why are they different? Which one corresponds to the one defined in Ender's book?
This is precisely because
Δ2 yt = yt - 2 yt - 1 + yt - 2 ≠ yt - yt - 2.
The left hand side is df.diff().diff(), whereas the right hand side is df.diff(2). For the difference in difference, you want the left hand side.
Consider;
df
a
b
c
d
df.diff() is
NaN
b - a
c - b
d - c
df.diff(2) is
NaN
NaN
c - a
d - b
df.diff().diff() is
NaN
NaN
(c - b) - (b - a) = c - 2b + a
(d - c) - (c - b) = d - 2c + b
They're not the same, mathematically.
I have a Pandas data frame which you might describe as “normalized”. For display purposes, I want to “de-normalize” the data. That is, I want to take some data spread across multiple key values which I want to put on the same row in the output records. Some records need to be summed as they are combined. (Aside: if anyone has a better term for this than “denormalization”, please make an edit to this question, or say so in the comments.)
I am working with a pandas data frame with many columns, so I will show you a simplified version below.
The following code sets up a (nearly) normalized source data frame. (Note that I am looking for advice on the second code block, and this code block is just to provide some context.) Similar to my actual data, there are some duplications in the identifying data, and some numbers to be summed:
import pandas as pd
dates = pd.date_range('20170701', periods=21)
datesA1 = pd.date_range('20170701', periods=11)
datesB1 = pd.date_range('20170705', periods=9)
datesA2 = pd.date_range('20170708', periods=10)
datesB2 = pd.date_range('20170710', periods=11)
datesC1 = pd.date_range('20170701', periods=5)
datesC2 = pd.date_range('20170709', periods=9)
cols=['Date','Type','Count']
df_A1 = pd.DataFrame({'Date':datesA1,
'Type':'Apples',
'Count': np.random.randint(30,size=11)})
df_A2 = pd.DataFrame({'Date':datesA2,
'Type':'Apples',
'Count': np.random.randint(30,size=10)})
df_B1 = pd.DataFrame({'Date':datesB1,
'Type':'Berries',
'Count': np.random.randint(30,size=9)})
df_B2 = pd.DataFrame({'Date':datesB2,
'Type':'Berries',
'Count': np.random.randint(30,size=11)})
df_C1 = pd.DataFrame({'Date':datesC1,
'Type':'Canteloupes',
'Count': np.random.randint(30,size=5)})
df_C2 = pd.DataFrame({'Date':datesC2,
'Type':'Canteloupes',
'Count': np.random.randint(30,size=9)})
frames = [df_A1, df_A2, df_B1, df_B2, df_C1, df_C2]
dat_fra_source = pd.concat(frames)
Further, the following code achieves my intention. The source data frame has multiple rows per date and type of fruit (A, B, and C). The destination data has a single row per day, with a sum of A, B, and C.
dat_fra_dest = pd.DataFrame(0, index=dates, columns=['Apples','Berries','Canteloupes'])
for index,row in dat_fra_source.iterrows():
dat_fra_dest.at[row['Date'],row['Type']]+=row['Count']
My question is if there is a cleaner way to do this: a way that doesn’t require the zero-initialization and/or a way that operates on the entire data frame instead of line-by-line. I am also skeptical that I have an efficient implementation. I’ll also note that while I am only dealing with “count” in the simplified example, I have additional columns in my real-world example. Think that for A, B, and C there is not only a count, but also a weight and a volume.
Option 1
dat_fra_source.groupby(['Date','Type']).sum().unstack().fillna(0)
Out[63]:
Count
Type Apples Berries Canteloupes
Date
2017-07-01 13.0 0.0 24.0
2017-07-02 18.0 0.0 16.0
2017-07-03 11.0 0.0 29.0
2017-07-04 13.0 0.0 7.0
2017-07-05 24.0 11.0 23.0
2017-07-06 6.0 4.0 0.0
2017-07-07 29.0 26.0 0.0
2017-07-08 31.0 19.0 0.0
2017-07-09 38.0 17.0 26.0
2017-07-10 57.0 54.0 1.0
2017-07-11 4.0 41.0 10.0
2017-07-12 16.0 28.0 23.0
2017-07-13 25.0 20.0 20.0
2017-07-14 19.0 6.0 15.0
2017-07-15 6.0 22.0 7.0
2017-07-16 16.0 0.0 5.0
2017-07-17 29.0 7.0 4.0
2017-07-18 0.0 21.0 0.0
2017-07-19 0.0 19.0 0.0
2017-07-20 0.0 8.0 0.0
Option 2
pd.pivot_table(dat_fra_source,index=['Date'],columns=['Type'],values='Count',aggfunc=sum).fillna(0)
Out[75]:
Type Apples Berries Canteloupes
Date
2017-07-01 13.0 0.0 24.0
2017-07-02 18.0 0.0 16.0
2017-07-03 11.0 0.0 29.0
2017-07-04 13.0 0.0 7.0
2017-07-05 24.0 11.0 23.0
2017-07-06 6.0 4.0 0.0
2017-07-07 29.0 26.0 0.0
2017-07-08 31.0 19.0 0.0
2017-07-09 38.0 17.0 26.0
2017-07-10 57.0 54.0 1.0
2017-07-11 4.0 41.0 10.0
2017-07-12 16.0 28.0 23.0
2017-07-13 25.0 20.0 20.0
2017-07-14 19.0 6.0 15.0
2017-07-15 6.0 22.0 7.0
2017-07-16 16.0 0.0 5.0
2017-07-17 29.0 7.0 4.0
2017-07-18 0.0 21.0 0.0
2017-07-19 0.0 19.0 0.0
2017-07-20 0.0 8.0 0.0
And assuming you have columns vol and weight
dat_fra_source['vol']=2
dat_fra_source['weight']=2
dat_fra_source.groupby(['Date','Type']).apply(lambda x: sum(x['vol']*x['weight']*x['Count'])).unstack().fillna(0)
Out[88]:
Type Apples Berries Canteloupes
Date
2017-07-01 52.0 0.0 96.0
2017-07-02 72.0 0.0 64.0
2017-07-03 44.0 0.0 116.0
2017-07-04 52.0 0.0 28.0
2017-07-05 96.0 44.0 92.0
2017-07-06 24.0 16.0 0.0
2017-07-07 116.0 104.0 0.0
2017-07-08 124.0 76.0 0.0
2017-07-09 152.0 68.0 104.0
2017-07-10 228.0 216.0 4.0
2017-07-11 16.0 164.0 40.0
2017-07-12 64.0 112.0 92.0
2017-07-13 100.0 80.0 80.0
2017-07-14 76.0 24.0 60.0
2017-07-15 24.0 88.0 28.0
2017-07-16 64.0 0.0 20.0
2017-07-17 116.0 28.0 16.0
2017-07-18 0.0 84.0 0.0
2017-07-19 0.0 76.0 0.0
2017-07-20 0.0 32.0 0.0
Use pd.crosstab:
pd.crosstab(dat_fra_source['Date'],
dat_fra_source['Type'],
dat_fra_source['Count'],
aggfunc='sum',
dropna=False).fillna(0)
Output:
Type Apples Berries Canteloupes
Date
2017-07-01 19.0 0.0 4.0
2017-07-02 25.0 0.0 4.0
2017-07-03 11.0 0.0 26.0
2017-07-04 27.0 0.0 8.0
2017-07-05 8.0 18.0 12.0
2017-07-06 10.0 11.0 0.0
2017-07-07 6.0 17.0 0.0
2017-07-08 10.0 5.0 0.0
2017-07-09 51.0 25.0 16.0
2017-07-10 31.0 23.0 21.0
2017-07-11 35.0 40.0 10.0
2017-07-12 16.0 30.0 9.0
2017-07-13 13.0 23.0 20.0
2017-07-14 21.0 26.0 27.0
2017-07-15 20.0 17.0 19.0
2017-07-16 12.0 4.0 2.0
2017-07-17 27.0 0.0 5.0
2017-07-18 0.0 5.0 0.0
2017-07-19 0.0 26.0 0.0
2017-07-20 0.0 6.0 0.0