Reshape dataframe with multiindexed column headers from wide to long - python

I'd like to reshape a pandas dataframe from wide to long. The challenge lies in the fact that the columns have got multiindexed column headers. The dataframe looks like this:
category price1 price2
year 2011 2012 2013 2011 2012 2013
1 33 22 48 135 144 149
2 22 26 37 136 127 129
3 39 30 47 123 148 148
4 45 42 21 140 126 121
5 20 37 35 141 142 147
6 29 20 34 122 121 132
7 20 35 45 128 123 130
8 39 34 49 125 120 131
9 24 20 36 122 146 130
10 24 37 43 142 133 138
11 23 22 40 124 135 131
12 27 22 40 147 149 132
Below is a snippet that produces the very same dataframe. You will also see that I've built this dataframe by concatenating two other dataframes.
Here's the snippet:
import pandas as pd
import numpy as np
# Make dataframe df1 with 12 observations over 3 years
# with multiindexed column headers
np.random.seed(123)
df1 = pd.DataFrame(np.random.randint(20, 50, size = (12,3)), columns=[2011,2012,2013])
df1.index = np.arange(1,len(df1)+1)
colNames1 = df1.columns
header1 = pd.MultiIndex.from_product([['price1'], colNames1], names=['category','year'])
df1.columns = header1
# Make dataframe df2 with 12 observations over 3 years
# with multiindexed column headers
df2 = pd.DataFrame(np.random.randint(120, 150, size = (12,3)), columns=[2011,2012,2013])
df2.index = np.arange(1,len(df2)+1)
colNames1 = df2.columns
header1 = pd.MultiIndex.from_product([['price2'], colNames1], names=['category','year'])
df2.columns = header1
df3 = pd.concat([df1, df2], axis = 1)
And here is the desired output:
price1 price2
1 2011 33 135
2 2011 22 136
3 2011 39 123
4 2011 45 140
5 2011 20 141
6 2011 29 122
7 2011 20 128
8 2011 39 125
9 2011 24 122
10 2011 24 142
11 2011 23 124
12 2011 27 147
1 2012 22 144
2 2012 26 127
3 2012 30 148
4 2012 42 126
5 2012 37 142
6 2012 20 121
7 2012 35 123
8 2012 34 120
9 2012 20 146
10 2012 37 133
11 2012 22 135
12 2012 22 149
1 2013 48 149
2 2013 37 129
3 2013 47 148
4 2013 21 121
5 2013 35 147
6 2013 34 132
7 2013 45 130
8 2013 49 131
9 2013 36 130
10 2013 43 138
11 2013 40 131
12 2013 40 132
I've tried different solutions based on suggestions with Reshape and pandas.wide_to_long, but I'm struggling with the multiindexed column names. So why not just remove this? Mostly because this is what my real world problem will look like, and also because I refuse to believe that it can't be done.
Thank you for any suggestions!

Use stack be last level and sort_index, add rename_axis and reset_index for columns:
df3 = (df3.stack()
.sort_index(level=[1,0])
.rename_axis(['months','year'])
.reset_index()
.rename_axis(None, 1))
print (df3.head(15))
months year price1 price2
0 1 2011 33 135
1 2 2011 22 136
2 3 2011 39 123
3 4 2011 45 140
4 5 2011 20 141
5 6 2011 29 122
6 7 2011 20 128
7 8 2011 39 125
8 9 2011 24 122
9 10 2011 24 142
10 11 2011 23 124
11 12 2011 27 147
12 1 2012 22 144
13 2 2012 26 127
14 3 2012 30 148
If need MutliIndex:
df3 = df3.stack().sort_index(level=[1,0])
print (df3.head(15))
category price1 price2
year
1 2011 33 135
2 2011 22 136
3 2011 39 123
4 2011 45 140
5 2011 20 141
6 2011 29 122
7 2011 20 128
8 2011 39 125
9 2011 24 122
10 2011 24 142
11 2011 23 124
12 2011 27 147
1 2012 22 144
2 2012 26 127
3 2012 30 148

Related

Cannot read PDF Data into Sheets with Gspread-DataFrame

I want to read data from a PDF I downloaded using Tabula into Google Sheets, and when I transfer the data as it was read into Google Sheets, I get an error. I know the data I downloaded is dirty, but I wanted to clean it up in Google Sheets.
Downloading Data from Pdf Portion of Full Portion of Code
import tabula
import pandas as pd
file_path = 'TnPresidentbyCountyNov2016.pdf'
df = tabula.read_pdf(file_path, pages='all', multiple_tables='FALSE', stream='TRUE')
print (df)
[ Anderson 19,212 9,013 74 1,034 42 174 189 28 0 0.1
0 Bedford 11,486 3,395 25 306 8 47 75 5 0 0
1 Benton 4,716 1,474 12 83 13 11 14 2 0 0
2 Bledsoe 3,622 897 7 95 4 9 18 2 0 0
3 Blount 37,443 12,100 83 1,666 72 250 313 51 1 1
4 Bradley 29,768 7,070 66 1,098 44 143 210 29 1 1
5 Campbell 9,870 2,248 32 251 25 43 45 5 0 0
6 Cannon 4,007 1,127 8 106 7 18 29 3 0 0
7 Carroll 7,756 2,327 22 181 20 18 39 2 0 0
8 Carter 16,898 3,453 30 409 20 54 130 26 0 0
9 Cheatham 11,297 3,878 26 463 13 50 99 8 0 0
10 Chester 5,081 1,243 5 115 4 12 10 4 0 0
11 Claiborne 8,602 1,832 16 192 24 27 29 2 0 0
12 Clay 2,141 707 2 47 2 10 11 0 0 0
13 Cocke 9,791 1,981 21 211 19 27 59 2 0 2
14 Coffee 14,417 4,743 32 517 23 62 113 9 0 1
15 Crockett 3,982 1,303 7 76 3 8 13 1 0 0
16 Cumberland 20,413 5,202 37 471 26 53 99 17 0 1
17 Davidson 84,550 148,864 412 9,603 304 619 2,459 106 0 6
18 Decatur 3,588 894 5 70 4 8 16 2 0 0
19 DeKalb 5,171 1,569 10 117 6 29 49 0 0 0
20 Dickson 13,233 4,722 32 489 18 58 94 9 0 3
21 Dyer 10,180 2,816 19 193 13 27 48 3 0 0
22 Fayette 13,055 5,874 19 261 16 37 62 21 0 0
23 Fentress 6,038 1,100 10 107 14 11 37 1 0 0
24 Franklin 11,532 4,374 28 319 16 36 66 7 0 0
25 Gibson 13,786 5,258 26 305 18 36 66 8 0 0
26 Giles 7,970 2,917 16 162 11 11 41 1 0 0
27 Grainger 6,626 1,154 17 130 12 28 26 4 0 0
28 Greene 18,562 4,216 28 481 29 56 152 14 0 0
29 Grundy 3,636 999 11 80 3 13 19 0 0 0
30 Hamblen 15,857 4,075 30 443 27 73 93 8 0 0
31 Hamilton 78,733 55,316 147 5,443 138 349 1,098 121 0 0
32 Hancock 1,843 322 4 42 1 5 13 0 0 0
33 Hardeman 4,919 4,185 18 84 11 13 30 9 0 0
34 Hardin 8,012 1,622 15 134 22 48 96 0 0 0
35 Hawkins 16,648 3,507 31 397 12 52 91 7 0 3
36 Haywood 3,013 3,711 11 60 10 10 19 0 0 0
37 Henderson 8,138 1,800 13 172 9 27 39 1 0 0
38 Henry 9,508 3,063 18 223 15 27 60 4 0 0
39 Hickman 5,695 1,824 20 161 19 15 39 18 0 0
40 Houston 2,182 866 9 88 4 7 12 0 0 0
41 Humphreys 4,930 1,967 17 166 12 23 26 5 0 0
42 Jackson 3,236 1,129 2 62 1 7 17 1 0 0
43 Jefferson 14,776 3,494 34 497 22 76 115 8 0 1
44 Johnson 5,410 988 11 102 7 9 39 6 0 0
45 Knox 105,767 62,878 382 7,458 227 986 1,634 122 0 9
46 Lake 1,357 577 5 18 1 6 6 0 0 0, Lauderdale 4,884 3,056 14 87 13 10 14.1 \
0 Lawrence 12,420 2,821 21 271 13 36 77
1 Lewis 3,585 890 14 59 8 9 42
2 Lincoln 10,398 2,554 19 231 13 39 46
3 Loudon 17,610 4,919 41 573 22 77 87
Just a sample of the data I pulled. Again, not what I completely envisioned, but as a beginner coder, I wanted to clean it up in Sheets
HERE is an image of the PDF I was downloading data from.
Here is the link to download the PDF I am downloading data from
Now I want to import gspread and gpsread_dataframe to upload into a Google Sheet tab and here is where I am having problems.
EDIT: Whereas neither section included all of my coding, now the top and bottom portions include all of my coding done so far.
from oauth2client.service_account import ServiceAccountCredentials
import json
import gspread
SHEET_ID = '18xad0TbNGMPh8gUSIsEr6wNsFzcpKGbyUIQ-A4GQ1bo'
SHEET_NAME = '2016'
gc = gspread.service_account('waynetennesseedems.json')
spreadsheet = gc.open_by_key(SHEET_ID)
worksheet = spreadsheet.worksheet(SHEET_NAME)
from gspread_dataframe import set_with_dataframe
set_with_dataframe(worksheet, df, include_column_header='False')
---------------------------------------------------------------------------
AttributeError Traceback (most recent call last)
/var/folders/zc/x2w76_4121g3gzfxybkz2q480000gn/T/ipykernel_44678/2784595029.py in <module>
----> 1 set_with_dataframe(worksheet, df, include_column_header='False')
/opt/anaconda3/lib/python3.9/site-packages/gspread_dataframe.py in set_with_dataframe(worksheet, dataframe, row, col, include_index, include_column_header, resize, allow_formulas, string_escaping)
260 # If header-related params are True, the values are adjusted
261 # to allow space for the headers.
--> 262 y, x = dataframe.shape
263 index_col_size = 0
264 column_header_size = 0
AttributeError: 'list' object has no attribute 'shape'
Does it have to do with how my Data was pulled from my PDF?
It seems that df is a list, first be sure to have downloaded the tabula-py module, secondly try to pass the parameter output_format='dataframe' to the tabula.read_pdf() function, like so:
import pandas as pd
import json
import gspread
from tabula.io import read_pdf
from oauth2client.service_account import ServiceAccountCredentials
from gspread_dataframe import set_with_dataframe
file_path = 'TnPresidentbyCountyNov2016.pdf'
df = read_pdf(file_path, output_format='dataframe', pages='all', multiple_tables='FALSE', stream='TRUE')
# print (df)
SHEET_ID = '18xad0TbNGMPh8gUSIsEr6wNsFzcpKGbyUIQ-A4GQ1bo'
SHEET_NAME = '2016'
gc = gspread.service_account('waynetennesseedems.json')
spreadsheet = gc.open_by_key(SHEET_ID)
worksheet = spreadsheet.worksheet(SHEET_NAME)
set_with_dataframe(worksheet, df, include_column_header='False')
Moreover I suggest you to take a look at the PEP8 style guide, to have a better idea on how to write a well formatted script.

How to group and sum rows by ID and subtract from group of rows with same ID? [python]

I have the following dataframe:
ID_A ID_B ID_C Value ID_B_Value_Sum
-----------------------------------------------
0 22 5 1 54 208
1 23 5 2 34 208
2 24 6 1 44 268
3 25 6 1 64 268
4 26 5 2 35 208
5 27 7 3 45 229
6 28 7 2 66 229
7 29 8 1 76 161
8 30 8 2 25 161
9 31 6 2 27 268
10 32 5 3 14 208
11 33 5 3 17 208
12 34 6 2 43 268
13 35 6 2 53 268
14 36 8 1 22 161
15 37 7 3 65 229
16 38 7 1 53 229
17 39 8 2 23 161
18 40 8 3 15 161
19 41 6 3 37 268
20 42 5 2 54 208
Each row contains a unique "ID_A", while different rows can have the same "ID_B" and "ID_C". Each row corresponds to its own "Value", where this "Value" can be the same between different rows. The "ID_B_Value_Sum" column contains the sums of all values from the "Value" column for all rows containing the same "ID_B". Calculating this sum is straightforward with python and pandas.
What I want to do is, for each row, take the "ID_B_Value_Sum" column, but subtract all values corresponding to rows containing the same "ID_C", exclusive of the target row. For example, taking "ID_B" = 6, we see the sum of all the "Value" values from this "ID_B" = 6 group = 268, as shown in all corresponding rows in the "ID_B_Value_Sum" column. Now, two of the rows in this group contain "ID_C" = 1, three rows in this group contain "ID_C" = 2, and one row in this group contain "ID_C" = 3. Starting with row 2, with "ID_C" = 1, this means taking the corresponding "ID_B_Value_Sum" value and subtracting the "Value" values from all other rows containing both "ID_B" = 6 and "ID_C = 1", exclusive of the target row. And so for row 2 I take 268 - 64 = 204. And for another example, for row 4, this means 208 - 34 - 54 = 120. And another example, for row 7, this means 161 - 22 = 139. These new values will go in a new "Value_Sum_New" column for each row.
And so I want to produce the following output dataframe:
ID_A ID_B ID_C Value ID_B_Value_Sum Value_Sum_New
---------------------------------------------------------------
0 22 5 1 54 208 XX
1 23 5 2 34 208 XX
2 24 6 1 44 268 204
3 25 6 1 64 268 XX
4 26 5 2 35 208 120
5 27 7 3 45 229 XX
6 28 7 2 66 229 XX
7 29 8 1 76 161 139
8 30 8 2 25 161 XX
9 31 6 2 27 268 XX
10 32 5 3 14 208 XX
11 33 5 3 17 208 XX
12 34 6 2 43 268 XX
13 35 6 2 53 268 XX
14 36 8 1 22 161 XX
15 37 7 3 65 229 XX
16 38 7 1 53 229 XX
17 39 8 2 23 161 XX
18 40 8 3 15 161 XX
19 41 6 3 37 268 XX
20 42 5 2 54 208 XX
What I am having trouble with conceptualizing is how to, for each row, group together all columns with the same "ID_B" and then group together all of those rows and sub-group all rows with the same "ID_C" and subtract their sum from the "Value" of the target row, but still including the "Value" from the target row, to create the final "Value_Sum_New". It seems like so many actions and sub-actions to take and I am confused with how to approach this in a simple and streamlined manner, as I am confused with how to organize and order the workflow. How might I approach calculating this sum in python?
IIUC, you need:
df['Value_Sum_New'] = (df['ID_B_Value_Sum']
- df.groupby(['ID_B', 'ID_C'])['Value'].transform('sum')
+ df['Value']
)
output:
ID_A ID_B ID_C Value ID_B_Value_Sum Value_Sum_New
0 22 5 1 54 208 208
1 23 5 2 34 208 119
2 24 6 1 44 268 204
3 25 6 1 64 268 224
4 26 5 2 35 208 120
5 27 7 3 45 229 164
6 28 7 2 66 229 229
7 29 8 1 76 161 139
8 30 8 2 25 161 138
9 31 6 2 27 268 172
10 32 5 3 14 208 191
11 33 5 3 17 208 194
12 34 6 2 43 268 188
13 35 6 2 53 268 198
14 36 8 1 22 161 85
15 37 7 3 65 229 184
16 38 7 1 53 229 229
17 39 8 2 23 161 136
18 40 8 3 15 161 161
19 41 6 3 37 268 268
20 42 5 2 54 208 139
explanation
As you said, computing a sum per group is easy in pandas. You can actually compute ID_B_Value_Sum with:
df['ID_B_Value_Sum'] = df.groupby('ID_B')['Value'].transform('sum')
Now we do the same for groups of ID_B + ID_C, we subtract it from ID_B_Value_Sum, and as we want to exclude only the other rows in the group, we add back the row Value itself.

How to plot multiple chart on one figure and combine with another?

# Create an axes object
axes = plt.gca()
# pass the axes object to plot function
df.plot(kind='line', x='鄉鎮別', y='男', ax=axes,figsize=(10,8));
df.plot(kind='line', x='鄉鎮別', y='女', ax=axes,figsize=(10,8));
df.plot(kind='line', x='鄉鎮別', y='合計(男+女)', ax=axes,figsize=(10,8),title='hihii',
xlabel='鄉鎮別',ylabel='人數')
It's my data.
鄉鎮別 鄰數 戶數 男 女 合計(男+女) 遷入 遷出 出生 死亡 結婚 離婚
0 苗栗市 715 32517 42956 43362 86318 212 458 33 65 28 13
1 苑裡鎮 362 15204 22979 21040 44019 118 154 17 24 9 7
2 通霄鎮 394 11557 17034 15178 32212 73 113 5 33 3 3
3 竹南鎮 518 32061 44069 43275 87344 410 392 31 59 35 11
4 頭份市 567 38231 52858 52089 104947 363 404 39 69 31 19
5 後龍鎮 367 12147 18244 16274 34518 93 144 12 41 2 7
6 卓蘭鎮 176 5861 8206 7504 15710 29 51 1 11 2 0
7 大湖鄉 180 5206 7142 6238 13380 31 59 5 21 3 2
8 公館鄉 281 10842 16486 15159 31645 89 169 12 32 5 3
9 銅鑼鄉 218 6106 8887 7890 16777 57 62 7 13 4 1
10 南庄鄉 184 3846 5066 4136 9202 22 48 1 10 0 2
11 頭屋鄉 120 3596 5289 4672 9961 59 53 2 11 4 4
12 三義鄉 161 5625 8097 7205 15302 47 63 3 12 3 5
13 西湖鄉 108 2617 3653 2866 6519 38 20 1 17 3 0
14 造橋鄉 115 4144 6276 5545 11821 44 64 3 11 3 2
15 三灣鄉 93 2331 3395 2832 6227 27 18 2 9 0 2
16 獅潭鄉 98 1723 2300 1851 4151 28 10 1 4 0 0
17 泰安鄉 64 1994 3085 2642 5727 36 26 2 8 4 1
18 總計 4721 195608 276022 259758 535780 1776 2308 177 450 139 82
This my output df.plot
First question is how to display Chinese?
Second is can I use without df.plot to plot line chart?
last question is : There are four graphs(use subplot): the line graphs of male and female population and total population(男、女、合計(男+女)) in each township; the line graphs of in-migration and out-migration(遷入和遷出); the long bar graphs of household number(戶數); and the line graphs of births and deaths(出生和死亡).

Is there a way to add condition to cumsum without "cutting" my table?

I found the code to calculate a YTD (year to date) value (basically a cumulative sum applied to a group by function passed on "year").
But now, I want this cumulative sum only for when the "Type" column is "Actual" and not "Budget". I'd like to have either empty spaces for when Type = "Budget", or ideally, I'd like to have 332 (the last value of the YTD) displayed for all the rows where Type = "Budget.
Initial table :
Value Type year month
0 100 Actual 2018 1
1 50 Actual 2018 2
2 20 Actual 2018 3
3 123 Actual 2018 4
4 56 Actual 2018 5
5 76 Actual 2018 6
6 98 Actual 2018 7
7 126 Actual 2018 8
8 90 Actual 2018 9
9 80 Actual 2018 10
10 67 Actual 2018 11
11 87 Actual 2018 12
12 101 Actual 2019 1
13 98 Actual 2019 2
14 76 Actual 2019 3
15 57 Actual 2019 4
16 98 Budget 2019 5
17 109 Budget 2019 6
18 123 Budget 2019 7
19 67 Budget 2019 8
20 98 Budget 2019 9
21 67 Budget 2019 10
22 98 Budget 2019 11
23 123 Budget 2019 12
This is the code that produced my actual table
df['YTD'] = df.groupby('year')['Value'].cumsum()
Value Type year month YTD
0 100 Actual 2018 1 100
1 50 Actual 2018 2 150
2 20 Actual 2018 3 170
3 123 Actual 2018 4 293
4 56 Actual 2018 5 349
5 76 Actual 2018 6 425
6 98 Actual 2018 7 523
7 126 Actual 2018 8 649
8 90 Actual 2018 9 739
9 80 Actual 2018 10 819
10 67 Actual 2018 11 886
11 87 Actual 2018 12 973
12 101 Actual 2019 1 101
13 98 Actual 2019 2 199
14 76 Actual 2019 3 275
15 57 Actual 2019 4 332
16 98 Budget 2019 5 430
17 109 Budget 2019 6 539
18 123 Budget 2019 7 662
19 67 Budget 2019 8 729
20 98 Budget 2019 9 827
21 67 Budget 2019 10 894
22 98 Budget 2019 11 992
23 123 Budget 2019 12 1115
Desired table :
Value Type year month YTD
0 100 Actual 2018 1 100
1 50 Actual 2018 2 150
2 20 Actual 2018 3 170
3 123 Actual 2018 4 293
4 56 Actual 2018 5 349
5 76 Actual 2018 6 425
6 98 Actual 2018 7 523
7 126 Actual 2018 8 649
8 90 Actual 2018 9 739
9 80 Actual 2018 10 819
10 67 Actual 2018 11 886
11 87 Actual 2018 12 973
12 101 Actual 2019 1 101
13 98 Actual 2019 2 199
14 76 Actual 2019 3 275
15 57 Actual 2019 4 332
16 98 Budget 2019 5 332
17 109 Budget 2019 6 332
18 123 Budget 2019 7 332
19 67 Budget 2019 8 332
20 98 Budget 2019 9 332
21 67 Budget 2019 10 332
22 98 Budget 2019 11 332
23 123 Budget 2019 12 332
A solution that I found was simply to set a condition (where Type = "Actual"), but in this case the whole table wouldn't display, whereas I need to display it entirely...
Do you have an idea to overcome the partial selection problem ?
Thank you
Alex
With DataFrame.loc we select only the rows where Type = Actual and we assign our cumsum to our new column YTD.
Then we fill our gaps of NaN with GroupBy.ffill:
m = df['Type'].eq('Actual')
df.loc[m, 'YTD'] = df.loc[m].groupby('year')['Value'].cumsum()
df['YTD'] = df.groupby('year')['YTD'].ffill()
Value Type year month YTD
0 100 Actual 2018 1 100.0
1 50 Actual 2018 2 150.0
2 20 Actual 2018 3 170.0
3 123 Actual 2018 4 293.0
4 56 Actual 2018 5 349.0
5 76 Actual 2018 6 425.0
6 98 Actual 2018 7 523.0
7 126 Actual 2018 8 649.0
8 90 Actual 2018 9 739.0
9 80 Actual 2018 10 819.0
10 67 Actual 2018 11 886.0
11 87 Actual 2018 12 973.0
12 101 Actual 2019 1 101.0
13 98 Actual 2019 2 199.0
14 76 Actual 2019 3 275.0
15 57 Actual 2019 4 332.0
16 98 Budget 2019 5 332.0
17 109 Budget 2019 6 332.0
18 123 Budget 2019 7 332.0
19 67 Budget 2019 8 332.0
20 98 Budget 2019 9 332.0
21 67 Budget 2019 10 332.0
22 98 Budget 2019 11 332.0
23 123 Budget 2019 12 332.0

performing differences between rows in pandas based on columns values

I have this dataframe, I'm trying to create a new column where I want to store the difference of products sold based on code and date.
for example this is the starting dataframe:
date code sold
0 20150521 0 47
1 20150521 12 39
2 20150521 16 39
3 20150521 20 38
4 20150521 24 38
5 20150521 28 37
6 20150521 32 36
7 20150521 4 43
8 20150521 8 43
9 20150522 0 47
10 20150522 12 37
11 20150522 16 36
12 20150522 20 36
13 20150522 24 36
14 20150522 28 35
15 20150522 32 31
16 20150522 4 42
17 20150522 8 41
18 20150523 0 50
19 20150523 12 48
20 20150523 16 46
21 20150523 20 46
22 20150523 24 46
23 20150523 28 45
24 20150523 32 42
25 20150523 4 49
26 20150523 8 49
27 20150524 0 39
28 20150524 12 33
29 20150524 16 30
... ... ... ...
150 20150606 32 22
151 20150606 4 34
152 20150606 8 33
153 20150607 0 31
154 20150607 12 30
155 20150607 16 30
156 20150607 20 29
157 20150607 24 28
158 20150607 28 26
159 20150607 32 24
160 20150607 4 30
161 20150607 8 30
162 20150608 0 47
I think this could be a solution...
full_df1=full_df[full_df.date == '20150609'].reset_index(drop=True)
full_df1['code'] = full_df1['code'].astype(float)
full_df1= full_df1.sort(['code'], ascending=[False])
code date sold
8 32 20150609 33
7 28 20150609 36
6 24 20150609 37
5 20 20150609 39
4 16 20150609 42
3 12 20150609 46
2 8 20150609 49
1 4 20150609 49
0 0 20150609 50
full_df1.set_index('code')['sold'].diff().reset_index()
that gives me back this output for a single date 20150609 :
code difference
0 32 NaN
1 28 3
2 24 1
3 20 2
4 16 3
5 12 4
6 8 3
7 4 0
8 0 1
is there a better solution to have the same result in a more pythonic way?
I would like to create a new column [difference] and store the data there having as result 4 columns [date, code, sold, difference]
This exactly the kind of thing that panda's groupby functionality is built for, and I highly recommend reading and working through this documentation: panda's groupby documentation
This code replicates what you are asking for, but for every date.
df = pd.DataFrame({'date':['Mon','Mon','Mon','Tue','Tue','Tue'],'code':[10,21,30,10,21,30], 'sold':[12,13,34,10,15,20]})
df['difference'] = df.groupby('date')['sold'].diff()
df
code date sold difference
0 10 Mon 12 NaN
1 21 Mon 13 1
2 30 Mon 34 21
3 10 Tue 10 NaN
4 21 Tue 15 5
5 30 Tue 20 5

Categories

Resources