Reshaping a multiindex pandas dataframe - python

I have a multiindex pandas dataframe that looks like this
ID I II III
METRIC a b c d a b c d a b c d
2015-08-01 0 1 2 3 20 21 22 23 40 41 42 43
2015-08-02 4 5 6 7 24 25 26 27 44 45 46 47
2015-08-03 8 9 10 11 28 29 30 31 48 49 50 51
where it is indexed by the dates (2015-08-01, 2015-08-02, 2015-08-03, etc.), the first-level columns (I, II, III) are IDs and the second-level columns are corresponding METRICs (a, b, c, d). I would like to reshape it to the following
METRIC a b c d
ID
I 2015-08-01 0 1 2 3
2015-08-02 4 5 6 7
2015-08-03 8 9 10 11
II 2015-08-01 20 21 22 23
2015-08-02 24 25 26 27
2015-08-03 28 29 30 31
III 2015-08-01 40 41 42 43
2015-08-02 44 45 46 47
2015-08-03 48 49 50 51
I have (unsuccessfully) looked into using .pivot, .stack, and .melt, but they don't give me what I am looking for. I currently loop over IDs and build a list of dataframes and concat them together as a new dataframe to get what I want.
Any suggestions would be greatly appreciated.

Let's use stack, swaplevel and sort_index:
df.stack(0).swaplevel(0,1).sort_index()
Output:
METRIC a b c d
ID
I 2015-08-01 0 1 2 3
2015-08-02 4 5 6 7
2015-08-03 8 9 10 11
II 2015-08-01 20 21 22 23
2015-08-02 24 25 26 27
2015-08-03 28 29 30 31
III 2015-08-01 40 41 42 43
2015-08-02 44 45 46 47
2015-08-03 48 49 50 51

You can let transpose or T do some of the work for you.
df.T.stack().unstack(1)
METRIC a b c d
ID
I 2015-08-01 0 1 2 3
2015-08-02 4 5 6 7
2015-08-03 8 9 10 11
II 2015-08-01 20 21 22 23
2015-08-02 24 25 26 27
2015-08-03 28 29 30 31
III 2015-08-01 40 41 42 43
2015-08-02 44 45 46 47
2015-08-03 48 49 50 51

Using #piRSquared's method, we can skip the transpose, just df.unstack().unstack(1)

Related

how to split an integer value from one column to two columns in text file using pandas or numpy (python)

I have a text file which has a number of integer values like this.
20180701 20180707 52 11 1 2 4 1 0 0 10 7 1 3 1 0 4 5 2
20180708 20180714 266 8 19 3 2 9 7 25 20 17 12 9 9 27 34 54 11
20180715 20180721 654 52 34 31 20 16 12 25 84 31 38 37 38 69 66 87 14
20180722 201807281017 110 72 46 52 29 29 22 204 41 46 51 57 67 82 92 17
20180729 201808041106 276 37 11 87 20 10 8 284 54 54 72 38 49 41 53 12
20180805 20180811 624 78 19 15 55 16 8 9 172 15 31 35 38 47 29 36 21
20180812 20180818 488 63 17 7 26 10 9 7 116 17 14 39 31 34 27 64 7
20180819 20180825 91 4 7 0 4 5 1 3 16 3 4 5 10 10 7 11 1
20180826 20180901 49 2 2 1 0 4 0 1 2 0 1 4 8 2 6 6 10
I have to make a file by merging several files like this but you guys can see a problem with this data.
In 4 and 5 lines, the first values, 1017 and 1106, right next to period index make a problem.
When I try to read these two lines, I always have had this result.
It came out that first values in first column next to index columns couldn't recognized as first values themselves.
In [14]: fw.iloc[80,:]
Out[14]:
3 72.0
4 46.0
5 52.0
6 29.0
7 29.0
8 22.0
9 204.0
10 41.0
11 46.0
12 51.0
13 57.0
14 67.0
15 82.0
16 92.0
17 17.0
18 NaN
Name: (20180722, 201807281017), dtype: float64
I tried to make it correct with indexing but failed.
The desirable result is,
In [14]: fw.iloc[80,:]
Out[14]:
2 1017.0
3 110.0
4 72.0
5 46.0
6 52.0
7 29.0
8 29.0
9 22.0
10 204.0
11 41.0
12 46.0
13 51.0
14 57.0
15 67.0
16 82.0
17 92.0
18 17.0
Name: (20180722, 201807281017), dtype: float64
How can I solve this problem?
+
I used this code to read this file.
fw = pd.read_csv('warm_patient.txt', index_col=[0,1], header=None, delim_whitespace=True)
A better fit for this would be pandas.read_fwf. For your example:
df = pd.read_fwf(filename, index_col=[0,1], header=None, widths=2*[10]+17*[4])
I don't know if the column widths can be inferred for all your data or need to be hardcoded.
One possibility would be to manually construct the dataframe, this way we can parse the text by splitting the values every 4 characters.
from textwrap import wrap
import pandas as pd
def read_file(f_name):
data = []
with open(f_name) as f:
for line in f.readlines():
idx1 = line[0:8]
idx2 = line[10:18]
points = map(lambda x: int(x.replace(" ", "")), wrap(line.rstrip()[18:], 4))
data.append([idx1, idx2, *points])
return pd.DataFrame(data).set_index([0, 1])
It could be made somewhat more efficient (in particular if this is a particularly long text file), but here's one solution.
fw = pd.read_csv('test.txt', header=None, delim_whitespace=True)
for i in fw[pd.isna(fw.iloc[:,-1])].index:
num_str = str(fw.iat[i,1])
a,b = map(int,[num_str[:-4],num_str[-4:]])
fw.iloc[i,3:] = fw.iloc[i,2:-1]
fw.iloc[i,:3] = [fw.iat[i,0],a,b]
fw = fw.set_index([0,1])
The result of print(fw) from there is
2 3 4 5 6 7 8 9 10 11 12 13 14 15 \
0 1
20180701 20180707 52 11 1 2 4 1 0 0 10 7 1 3 1 0
20180708 20180714 266 8 19 3 2 9 7 25 20 17 12 9 9 27
20180715 20180721 654 52 34 31 20 16 12 25 84 31 38 37 38 69
20180722 20180728 1017 110 72 46 52 29 29 22 204 41 46 51 57 67
20180729 20180804 1106 276 37 11 87 20 10 8 284 54 54 72 38 49
20180805 20180811 624 78 19 15 55 16 8 9 172 15 31 35 38 47
20180812 20180818 488 63 17 7 26 10 9 7 116 17 14 39 31 34
20180819 20180825 91 4 7 0 4 5 1 3 16 3 4 5 10 10
20180826 20180901 49 2 2 1 0 4 0 1 2 0 1 4 8 2
16 17 18
0 1
20180701 20180707 4 5 2.0
20180708 20180714 34 54 11.0
20180715 20180721 66 87 14.0
20180722 20180728 82 92 17.0
20180729 20180804 41 53 12.0
20180805 20180811 29 36 21.0
20180812 20180818 27 64 7.0
20180819 20180825 7 11 1.0
20180826 20180901 6 6 10.0
Here's the result of the print after applying your initial solution of fw = pd.read_csv('test.txt', index_col=[0,1], header=None, delim_whitespace=True) for comparison.
2 3 4 5 6 7 8 9 10 11 12 13 14 \
0 1
20180701 20180707 52 11 1 2 4 1 0 0 10 7 1 3 1
20180708 20180714 266 8 19 3 2 9 7 25 20 17 12 9 9
20180715 20180721 654 52 34 31 20 16 12 25 84 31 38 37 38
20180722 201807281017 110 72 46 52 29 29 22 204 41 46 51 57 67
20180729 201808041106 276 37 11 87 20 10 8 284 54 54 72 38 49
20180805 20180811 624 78 19 15 55 16 8 9 172 15 31 35 38
20180812 20180818 488 63 17 7 26 10 9 7 116 17 14 39 31
20180819 20180825 91 4 7 0 4 5 1 3 16 3 4 5 10
20180826 20180901 49 2 2 1 0 4 0 1 2 0 1 4 8
15 16 17 18
0 1
20180701 20180707 0 4 5 2.0
20180708 20180714 27 34 54 11.0
20180715 20180721 69 66 87 14.0
20180722 201807281017 82 92 17 NaN
20180729 201808041106 41 53 12 NaN
20180805 20180811 47 29 36 21.0
20180812 20180818 34 27 64 7.0
20180819 20180825 10 7 11 1.0
20180826 20180901 2 6 6 10.0

Create bi-weekly and monthly labels with week numbers in pandas

I have a dataframe with profit values, IDs, and week values. It looks a little like this
ID
Week
Profit
A
1
2
A
2
2
A
3
0
A
4
0
I want to create two new columns called "Bi-Weekly" and "Monthly", so week 1 would be label 2, week 2 would also be label 2, but week 3 would be labeled 4, and week 4 would be labeled 4, and they would all be labeled month 1, so I could groupby weekly, bi-weekly, or monthly profit as needed. Right now I've created two functions which work, but the weeks are going to go up to a year (52 weeks) so I was wondering if there's a more efficient way. My bi-weekly function below.
def biweek(prof_calc):
if (prof_calc['week']==2):
return 2
elif (prof_calc['week']==3):
return 2
elif (prof_calc['week']==4):
return 4
elif (prof_calc['week']==5):
return 4
elif (prof_calc['week']==6):
return 6
elif (prof_calc['week']==7):
return 6
elif (prof_calc['week']==8):
return 8
elif (prof_calc['week']==9):
return 8
elif (prof_calc['week']==10):
return 10
elif (prof_calc['week']==11):
return 10
prof_calc['BiWeek'] = prof_calc.apply(biweek, axis=1)
IIUC, you could try:
df["Biweekly"] = (df["Week"]-1)//2+1
df["Monthly"] = (df["Week"]-1)//4+1
>>> df
ID Week Profit Biweekly Monthly
0 A 1 42 1 1
1 A 2 69 1 1
2 A 3 53 2 1
3 A 4 63 2 1
4 A 5 56 3 2
5 A 6 57 3 2
6 A 7 86 4 2
7 A 8 23 4 2
8 A 9 35 5 3
9 A 10 10 5 3
10 A 11 25 6 3
11 A 12 21 6 3
12 A 13 39 7 4
13 A 14 82 7 4
14 A 15 76 8 4
15 A 16 20 8 4
16 A 17 97 9 5
17 A 18 67 9 5
18 A 19 21 10 5
19 A 20 22 10 5
20 A 21 88 11 6
21 A 22 67 11 6
22 A 23 33 12 6
23 A 24 38 12 6
24 A 25 8 13 7
25 A 26 67 13 7
26 A 27 16 14 7
27 A 28 49 14 7
28 A 29 3 15 8
29 A 30 17 15 8
30 A 31 79 16 8
31 A 32 19 16 8
32 A 33 21 17 9
33 A 34 9 17 9
34 A 35 56 18 9
35 A 36 83 18 9
36 A 37 1 19 10
37 A 38 53 19 10
38 A 39 66 20 10
39 A 40 55 20 10
40 A 41 85 21 11
41 A 42 90 21 11
42 A 43 34 22 11
43 A 44 3 22 11
44 A 45 9 23 12
45 A 46 28 23 12
46 A 47 58 24 12
47 A 48 14 24 12
48 A 49 42 25 13
49 A 50 69 25 13
50 A 51 76 26 13
51 A 52 49 26 13

Categorise hour into four different slots of 15 mins

I am working on a dataframe and I want to group the data for an hour into 4 different slots of 15 mins,
0-15 - 1st slot
15-30 - 2nd slot
30-45 - 3rd slot
45-00(or 60) - 4th slot
I am not even able to think, how to go forward with this
I tried extracting hours, minutes and seconds from the time, but what to do now?
Use integer division by 15 and then add 1:
df = pd.DataFrame({'M': range(60)})
df['slot'] = df['M'] // 15 + 1
print (df)
M slot
0 0 1
1 1 1
2 2 1
3 3 1
4 4 1
5 5 1
6 6 1
7 7 1
8 8 1
9 9 1
10 10 1
11 11 1
12 12 1
13 13 1
14 14 1
15 15 2
16 16 2
17 17 2
18 18 2
19 19 2
20 20 2
21 21 2
22 22 2
23 23 2
24 24 2
25 25 2
26 26 2
27 27 2
28 28 2
29 29 2
30 30 3
31 31 3
32 32 3
33 33 3
34 34 3
35 35 3
36 36 3
37 37 3
38 38 3
39 39 3
40 40 3
41 41 3
42 42 3
43 43 3
44 44 3
45 45 4
46 46 4
47 47 4
48 48 4
49 49 4
50 50 4
51 51 4
52 52 4
53 53 4
54 54 4
55 55 4
56 56 4
57 57 4
58 58 4
59 59 4

Place data from a Pandas DF into a Grid or Template

I have process where the end product is a Pandas DF where the output, which is variable in terms of data and length, is structured like this example of the output.
9 80340796
10 80340797
11 80340798
12 80340799
13 80340800
14 80340801
15 80340802
16 80340803
17 80340804
18 80340805
19 80340806
20 80340807
21 80340808
22 80340809
23 80340810
24 80340811
25 80340812
26 80340813
27 80340814
28 80340815
29 80340816
30 80340817
31 80340818
32 80340819
33 80340820
34 80340821
35 80340822
36 80340823
37 80340824
38 80340825
39 80340826
40 80340827
41 80340828
42 80340829
43 80340830
44 80340831
45 80340832
46 80340833
I need to get the numbers in the second column above, into the following grid format based on the numbers in the first column above.
1 2 3 4 5 6 7 8 9 10 11 12
A 1 9 17 25 33 41 49 57 65 73 81 89
B 2 10 18 26 34 42 50 58 66 74 82 90
C 3 11 19 27 35 43 51 59 67 75 83 91
D 4 12 20 28 36 44 52 60 68 76 84 92
E 5 13 21 29 37 45 53 61 69 77 85 93
F 6 14 22 30 38 46 54 62 70 78 86 94
G 7 15 23 31 39 47 55 63 71 79 87 95
H 8 16 24 32 40 48 56 64 72 80 88 96
So the end result in this example would be
Any advice on how to go about this would be much appreciated. I've been asked for this by a colleague, so the data is easy to read for their team (as it matches the layout of a physical test) but I have no idea how to produce it.
pandas pivot table, can do what you want in your question, but first you have to create 2 auxillary columns, 1 determing which column the value has to go in, another which row it is. You can get that as shown in the following example:
import numpy as np
import pandas as pd
df = pd.DataFrame({'num': list(range(9, 28)), 'val': list(range(80001, 80020))})
max_rows = 8
df['row'] = (df['num']-1)%8
df['col'] = np.ceil(df['num']/8).astype(int)
df.pivot_table(values=['val'], columns=['col'], index=['row'])
val
col 2 3 4
row
0 80001.0 80009.0 80017.0
1 80002.0 80010.0 80018.0
2 80003.0 80011.0 80019.0
3 80004.0 80012.0 NaN
4 80005.0 80013.0 NaN
5 80006.0 80014.0 NaN
6 80007.0 80015.0 NaN
7 80008.0 80016.0 NaN

performing differences between rows in pandas based on columns values

I have this dataframe, I'm trying to create a new column where I want to store the difference of products sold based on code and date.
for example this is the starting dataframe:
date code sold
0 20150521 0 47
1 20150521 12 39
2 20150521 16 39
3 20150521 20 38
4 20150521 24 38
5 20150521 28 37
6 20150521 32 36
7 20150521 4 43
8 20150521 8 43
9 20150522 0 47
10 20150522 12 37
11 20150522 16 36
12 20150522 20 36
13 20150522 24 36
14 20150522 28 35
15 20150522 32 31
16 20150522 4 42
17 20150522 8 41
18 20150523 0 50
19 20150523 12 48
20 20150523 16 46
21 20150523 20 46
22 20150523 24 46
23 20150523 28 45
24 20150523 32 42
25 20150523 4 49
26 20150523 8 49
27 20150524 0 39
28 20150524 12 33
29 20150524 16 30
... ... ... ...
150 20150606 32 22
151 20150606 4 34
152 20150606 8 33
153 20150607 0 31
154 20150607 12 30
155 20150607 16 30
156 20150607 20 29
157 20150607 24 28
158 20150607 28 26
159 20150607 32 24
160 20150607 4 30
161 20150607 8 30
162 20150608 0 47
I think this could be a solution...
full_df1=full_df[full_df.date == '20150609'].reset_index(drop=True)
full_df1['code'] = full_df1['code'].astype(float)
full_df1= full_df1.sort(['code'], ascending=[False])
code date sold
8 32 20150609 33
7 28 20150609 36
6 24 20150609 37
5 20 20150609 39
4 16 20150609 42
3 12 20150609 46
2 8 20150609 49
1 4 20150609 49
0 0 20150609 50
full_df1.set_index('code')['sold'].diff().reset_index()
that gives me back this output for a single date 20150609 :
code difference
0 32 NaN
1 28 3
2 24 1
3 20 2
4 16 3
5 12 4
6 8 3
7 4 0
8 0 1
is there a better solution to have the same result in a more pythonic way?
I would like to create a new column [difference] and store the data there having as result 4 columns [date, code, sold, difference]
This exactly the kind of thing that panda's groupby functionality is built for, and I highly recommend reading and working through this documentation: panda's groupby documentation
This code replicates what you are asking for, but for every date.
df = pd.DataFrame({'date':['Mon','Mon','Mon','Tue','Tue','Tue'],'code':[10,21,30,10,21,30], 'sold':[12,13,34,10,15,20]})
df['difference'] = df.groupby('date')['sold'].diff()
df
code date sold difference
0 10 Mon 12 NaN
1 21 Mon 13 1
2 30 Mon 34 21
3 10 Tue 10 NaN
4 21 Tue 15 5
5 30 Tue 20 5

Categories

Resources