I want to read data from a PDF I downloaded using Tabula into Google Sheets, and when I transfer the data as it was read into Google Sheets, I get an error. I know the data I downloaded is dirty, but I wanted to clean it up in Google Sheets.
Downloading Data from Pdf Portion of Full Portion of Code
import tabula
import pandas as pd
file_path = 'TnPresidentbyCountyNov2016.pdf'
df = tabula.read_pdf(file_path, pages='all', multiple_tables='FALSE', stream='TRUE')
print (df)
[ Anderson 19,212 9,013 74 1,034 42 174 189 28 0 0.1
0 Bedford 11,486 3,395 25 306 8 47 75 5 0 0
1 Benton 4,716 1,474 12 83 13 11 14 2 0 0
2 Bledsoe 3,622 897 7 95 4 9 18 2 0 0
3 Blount 37,443 12,100 83 1,666 72 250 313 51 1 1
4 Bradley 29,768 7,070 66 1,098 44 143 210 29 1 1
5 Campbell 9,870 2,248 32 251 25 43 45 5 0 0
6 Cannon 4,007 1,127 8 106 7 18 29 3 0 0
7 Carroll 7,756 2,327 22 181 20 18 39 2 0 0
8 Carter 16,898 3,453 30 409 20 54 130 26 0 0
9 Cheatham 11,297 3,878 26 463 13 50 99 8 0 0
10 Chester 5,081 1,243 5 115 4 12 10 4 0 0
11 Claiborne 8,602 1,832 16 192 24 27 29 2 0 0
12 Clay 2,141 707 2 47 2 10 11 0 0 0
13 Cocke 9,791 1,981 21 211 19 27 59 2 0 2
14 Coffee 14,417 4,743 32 517 23 62 113 9 0 1
15 Crockett 3,982 1,303 7 76 3 8 13 1 0 0
16 Cumberland 20,413 5,202 37 471 26 53 99 17 0 1
17 Davidson 84,550 148,864 412 9,603 304 619 2,459 106 0 6
18 Decatur 3,588 894 5 70 4 8 16 2 0 0
19 DeKalb 5,171 1,569 10 117 6 29 49 0 0 0
20 Dickson 13,233 4,722 32 489 18 58 94 9 0 3
21 Dyer 10,180 2,816 19 193 13 27 48 3 0 0
22 Fayette 13,055 5,874 19 261 16 37 62 21 0 0
23 Fentress 6,038 1,100 10 107 14 11 37 1 0 0
24 Franklin 11,532 4,374 28 319 16 36 66 7 0 0
25 Gibson 13,786 5,258 26 305 18 36 66 8 0 0
26 Giles 7,970 2,917 16 162 11 11 41 1 0 0
27 Grainger 6,626 1,154 17 130 12 28 26 4 0 0
28 Greene 18,562 4,216 28 481 29 56 152 14 0 0
29 Grundy 3,636 999 11 80 3 13 19 0 0 0
30 Hamblen 15,857 4,075 30 443 27 73 93 8 0 0
31 Hamilton 78,733 55,316 147 5,443 138 349 1,098 121 0 0
32 Hancock 1,843 322 4 42 1 5 13 0 0 0
33 Hardeman 4,919 4,185 18 84 11 13 30 9 0 0
34 Hardin 8,012 1,622 15 134 22 48 96 0 0 0
35 Hawkins 16,648 3,507 31 397 12 52 91 7 0 3
36 Haywood 3,013 3,711 11 60 10 10 19 0 0 0
37 Henderson 8,138 1,800 13 172 9 27 39 1 0 0
38 Henry 9,508 3,063 18 223 15 27 60 4 0 0
39 Hickman 5,695 1,824 20 161 19 15 39 18 0 0
40 Houston 2,182 866 9 88 4 7 12 0 0 0
41 Humphreys 4,930 1,967 17 166 12 23 26 5 0 0
42 Jackson 3,236 1,129 2 62 1 7 17 1 0 0
43 Jefferson 14,776 3,494 34 497 22 76 115 8 0 1
44 Johnson 5,410 988 11 102 7 9 39 6 0 0
45 Knox 105,767 62,878 382 7,458 227 986 1,634 122 0 9
46 Lake 1,357 577 5 18 1 6 6 0 0 0, Lauderdale 4,884 3,056 14 87 13 10 14.1 \
0 Lawrence 12,420 2,821 21 271 13 36 77
1 Lewis 3,585 890 14 59 8 9 42
2 Lincoln 10,398 2,554 19 231 13 39 46
3 Loudon 17,610 4,919 41 573 22 77 87
Just a sample of the data I pulled. Again, not what I completely envisioned, but as a beginner coder, I wanted to clean it up in Sheets
HERE is an image of the PDF I was downloading data from.
Here is the link to download the PDF I am downloading data from
Now I want to import gspread and gpsread_dataframe to upload into a Google Sheet tab and here is where I am having problems.
EDIT: Whereas neither section included all of my coding, now the top and bottom portions include all of my coding done so far.
from oauth2client.service_account import ServiceAccountCredentials
import json
import gspread
SHEET_ID = '18xad0TbNGMPh8gUSIsEr6wNsFzcpKGbyUIQ-A4GQ1bo'
SHEET_NAME = '2016'
gc = gspread.service_account('waynetennesseedems.json')
spreadsheet = gc.open_by_key(SHEET_ID)
worksheet = spreadsheet.worksheet(SHEET_NAME)
from gspread_dataframe import set_with_dataframe
set_with_dataframe(worksheet, df, include_column_header='False')
---------------------------------------------------------------------------
AttributeError Traceback (most recent call last)
/var/folders/zc/x2w76_4121g3gzfxybkz2q480000gn/T/ipykernel_44678/2784595029.py in <module>
----> 1 set_with_dataframe(worksheet, df, include_column_header='False')
/opt/anaconda3/lib/python3.9/site-packages/gspread_dataframe.py in set_with_dataframe(worksheet, dataframe, row, col, include_index, include_column_header, resize, allow_formulas, string_escaping)
260 # If header-related params are True, the values are adjusted
261 # to allow space for the headers.
--> 262 y, x = dataframe.shape
263 index_col_size = 0
264 column_header_size = 0
AttributeError: 'list' object has no attribute 'shape'
Does it have to do with how my Data was pulled from my PDF?
It seems that df is a list, first be sure to have downloaded the tabula-py module, secondly try to pass the parameter output_format='dataframe' to the tabula.read_pdf() function, like so:
import pandas as pd
import json
import gspread
from tabula.io import read_pdf
from oauth2client.service_account import ServiceAccountCredentials
from gspread_dataframe import set_with_dataframe
file_path = 'TnPresidentbyCountyNov2016.pdf'
df = read_pdf(file_path, output_format='dataframe', pages='all', multiple_tables='FALSE', stream='TRUE')
# print (df)
SHEET_ID = '18xad0TbNGMPh8gUSIsEr6wNsFzcpKGbyUIQ-A4GQ1bo'
SHEET_NAME = '2016'
gc = gspread.service_account('waynetennesseedems.json')
spreadsheet = gc.open_by_key(SHEET_ID)
worksheet = spreadsheet.worksheet(SHEET_NAME)
set_with_dataframe(worksheet, df, include_column_header='False')
Moreover I suggest you to take a look at the PEP8 style guide, to have a better idea on how to write a well formatted script.
I have the following dataframe:
ID_A ID_B ID_C Value ID_B_Value_Sum
-----------------------------------------------
0 22 5 1 54 208
1 23 5 2 34 208
2 24 6 1 44 268
3 25 6 1 64 268
4 26 5 2 35 208
5 27 7 3 45 229
6 28 7 2 66 229
7 29 8 1 76 161
8 30 8 2 25 161
9 31 6 2 27 268
10 32 5 3 14 208
11 33 5 3 17 208
12 34 6 2 43 268
13 35 6 2 53 268
14 36 8 1 22 161
15 37 7 3 65 229
16 38 7 1 53 229
17 39 8 2 23 161
18 40 8 3 15 161
19 41 6 3 37 268
20 42 5 2 54 208
Each row contains a unique "ID_A", while different rows can have the same "ID_B" and "ID_C". Each row corresponds to its own "Value", where this "Value" can be the same between different rows. The "ID_B_Value_Sum" column contains the sums of all values from the "Value" column for all rows containing the same "ID_B". Calculating this sum is straightforward with python and pandas.
What I want to do is, for each row, take the "ID_B_Value_Sum" column, but subtract all values corresponding to rows containing the same "ID_C", exclusive of the target row. For example, taking "ID_B" = 6, we see the sum of all the "Value" values from this "ID_B" = 6 group = 268, as shown in all corresponding rows in the "ID_B_Value_Sum" column. Now, two of the rows in this group contain "ID_C" = 1, three rows in this group contain "ID_C" = 2, and one row in this group contain "ID_C" = 3. Starting with row 2, with "ID_C" = 1, this means taking the corresponding "ID_B_Value_Sum" value and subtracting the "Value" values from all other rows containing both "ID_B" = 6 and "ID_C = 1", exclusive of the target row. And so for row 2 I take 268 - 64 = 204. And for another example, for row 4, this means 208 - 34 - 54 = 120. And another example, for row 7, this means 161 - 22 = 139. These new values will go in a new "Value_Sum_New" column for each row.
And so I want to produce the following output dataframe:
ID_A ID_B ID_C Value ID_B_Value_Sum Value_Sum_New
---------------------------------------------------------------
0 22 5 1 54 208 XX
1 23 5 2 34 208 XX
2 24 6 1 44 268 204
3 25 6 1 64 268 XX
4 26 5 2 35 208 120
5 27 7 3 45 229 XX
6 28 7 2 66 229 XX
7 29 8 1 76 161 139
8 30 8 2 25 161 XX
9 31 6 2 27 268 XX
10 32 5 3 14 208 XX
11 33 5 3 17 208 XX
12 34 6 2 43 268 XX
13 35 6 2 53 268 XX
14 36 8 1 22 161 XX
15 37 7 3 65 229 XX
16 38 7 1 53 229 XX
17 39 8 2 23 161 XX
18 40 8 3 15 161 XX
19 41 6 3 37 268 XX
20 42 5 2 54 208 XX
What I am having trouble with conceptualizing is how to, for each row, group together all columns with the same "ID_B" and then group together all of those rows and sub-group all rows with the same "ID_C" and subtract their sum from the "Value" of the target row, but still including the "Value" from the target row, to create the final "Value_Sum_New". It seems like so many actions and sub-actions to take and I am confused with how to approach this in a simple and streamlined manner, as I am confused with how to organize and order the workflow. How might I approach calculating this sum in python?
IIUC, you need:
df['Value_Sum_New'] = (df['ID_B_Value_Sum']
- df.groupby(['ID_B', 'ID_C'])['Value'].transform('sum')
+ df['Value']
)
output:
ID_A ID_B ID_C Value ID_B_Value_Sum Value_Sum_New
0 22 5 1 54 208 208
1 23 5 2 34 208 119
2 24 6 1 44 268 204
3 25 6 1 64 268 224
4 26 5 2 35 208 120
5 27 7 3 45 229 164
6 28 7 2 66 229 229
7 29 8 1 76 161 139
8 30 8 2 25 161 138
9 31 6 2 27 268 172
10 32 5 3 14 208 191
11 33 5 3 17 208 194
12 34 6 2 43 268 188
13 35 6 2 53 268 198
14 36 8 1 22 161 85
15 37 7 3 65 229 184
16 38 7 1 53 229 229
17 39 8 2 23 161 136
18 40 8 3 15 161 161
19 41 6 3 37 268 268
20 42 5 2 54 208 139
explanation
As you said, computing a sum per group is easy in pandas. You can actually compute ID_B_Value_Sum with:
df['ID_B_Value_Sum'] = df.groupby('ID_B')['Value'].transform('sum')
Now we do the same for groups of ID_B + ID_C, we subtract it from ID_B_Value_Sum, and as we want to exclude only the other rows in the group, we add back the row Value itself.
# Create an axes object
axes = plt.gca()
# pass the axes object to plot function
df.plot(kind='line', x='鄉鎮別', y='男', ax=axes,figsize=(10,8));
df.plot(kind='line', x='鄉鎮別', y='女', ax=axes,figsize=(10,8));
df.plot(kind='line', x='鄉鎮別', y='合計(男+女)', ax=axes,figsize=(10,8),title='hihii',
xlabel='鄉鎮別',ylabel='人數')
It's my data.
鄉鎮別 鄰數 戶數 男 女 合計(男+女) 遷入 遷出 出生 死亡 結婚 離婚
0 苗栗市 715 32517 42956 43362 86318 212 458 33 65 28 13
1 苑裡鎮 362 15204 22979 21040 44019 118 154 17 24 9 7
2 通霄鎮 394 11557 17034 15178 32212 73 113 5 33 3 3
3 竹南鎮 518 32061 44069 43275 87344 410 392 31 59 35 11
4 頭份市 567 38231 52858 52089 104947 363 404 39 69 31 19
5 後龍鎮 367 12147 18244 16274 34518 93 144 12 41 2 7
6 卓蘭鎮 176 5861 8206 7504 15710 29 51 1 11 2 0
7 大湖鄉 180 5206 7142 6238 13380 31 59 5 21 3 2
8 公館鄉 281 10842 16486 15159 31645 89 169 12 32 5 3
9 銅鑼鄉 218 6106 8887 7890 16777 57 62 7 13 4 1
10 南庄鄉 184 3846 5066 4136 9202 22 48 1 10 0 2
11 頭屋鄉 120 3596 5289 4672 9961 59 53 2 11 4 4
12 三義鄉 161 5625 8097 7205 15302 47 63 3 12 3 5
13 西湖鄉 108 2617 3653 2866 6519 38 20 1 17 3 0
14 造橋鄉 115 4144 6276 5545 11821 44 64 3 11 3 2
15 三灣鄉 93 2331 3395 2832 6227 27 18 2 9 0 2
16 獅潭鄉 98 1723 2300 1851 4151 28 10 1 4 0 0
17 泰安鄉 64 1994 3085 2642 5727 36 26 2 8 4 1
18 總計 4721 195608 276022 259758 535780 1776 2308 177 450 139 82
This my output df.plot
First question is how to display Chinese?
Second is can I use without df.plot to plot line chart?
last question is : There are four graphs(use subplot): the line graphs of male and female population and total population(男、女、合計(男+女)) in each township; the line graphs of in-migration and out-migration(遷入和遷出); the long bar graphs of household number(戶數); and the line graphs of births and deaths(出生和死亡).
I found the code to calculate a YTD (year to date) value (basically a cumulative sum applied to a group by function passed on "year").
But now, I want this cumulative sum only for when the "Type" column is "Actual" and not "Budget". I'd like to have either empty spaces for when Type = "Budget", or ideally, I'd like to have 332 (the last value of the YTD) displayed for all the rows where Type = "Budget.
Initial table :
Value Type year month
0 100 Actual 2018 1
1 50 Actual 2018 2
2 20 Actual 2018 3
3 123 Actual 2018 4
4 56 Actual 2018 5
5 76 Actual 2018 6
6 98 Actual 2018 7
7 126 Actual 2018 8
8 90 Actual 2018 9
9 80 Actual 2018 10
10 67 Actual 2018 11
11 87 Actual 2018 12
12 101 Actual 2019 1
13 98 Actual 2019 2
14 76 Actual 2019 3
15 57 Actual 2019 4
16 98 Budget 2019 5
17 109 Budget 2019 6
18 123 Budget 2019 7
19 67 Budget 2019 8
20 98 Budget 2019 9
21 67 Budget 2019 10
22 98 Budget 2019 11
23 123 Budget 2019 12
This is the code that produced my actual table
df['YTD'] = df.groupby('year')['Value'].cumsum()
Value Type year month YTD
0 100 Actual 2018 1 100
1 50 Actual 2018 2 150
2 20 Actual 2018 3 170
3 123 Actual 2018 4 293
4 56 Actual 2018 5 349
5 76 Actual 2018 6 425
6 98 Actual 2018 7 523
7 126 Actual 2018 8 649
8 90 Actual 2018 9 739
9 80 Actual 2018 10 819
10 67 Actual 2018 11 886
11 87 Actual 2018 12 973
12 101 Actual 2019 1 101
13 98 Actual 2019 2 199
14 76 Actual 2019 3 275
15 57 Actual 2019 4 332
16 98 Budget 2019 5 430
17 109 Budget 2019 6 539
18 123 Budget 2019 7 662
19 67 Budget 2019 8 729
20 98 Budget 2019 9 827
21 67 Budget 2019 10 894
22 98 Budget 2019 11 992
23 123 Budget 2019 12 1115
Desired table :
Value Type year month YTD
0 100 Actual 2018 1 100
1 50 Actual 2018 2 150
2 20 Actual 2018 3 170
3 123 Actual 2018 4 293
4 56 Actual 2018 5 349
5 76 Actual 2018 6 425
6 98 Actual 2018 7 523
7 126 Actual 2018 8 649
8 90 Actual 2018 9 739
9 80 Actual 2018 10 819
10 67 Actual 2018 11 886
11 87 Actual 2018 12 973
12 101 Actual 2019 1 101
13 98 Actual 2019 2 199
14 76 Actual 2019 3 275
15 57 Actual 2019 4 332
16 98 Budget 2019 5 332
17 109 Budget 2019 6 332
18 123 Budget 2019 7 332
19 67 Budget 2019 8 332
20 98 Budget 2019 9 332
21 67 Budget 2019 10 332
22 98 Budget 2019 11 332
23 123 Budget 2019 12 332
A solution that I found was simply to set a condition (where Type = "Actual"), but in this case the whole table wouldn't display, whereas I need to display it entirely...
Do you have an idea to overcome the partial selection problem ?
Thank you
Alex
With DataFrame.loc we select only the rows where Type = Actual and we assign our cumsum to our new column YTD.
Then we fill our gaps of NaN with GroupBy.ffill:
m = df['Type'].eq('Actual')
df.loc[m, 'YTD'] = df.loc[m].groupby('year')['Value'].cumsum()
df['YTD'] = df.groupby('year')['YTD'].ffill()
Value Type year month YTD
0 100 Actual 2018 1 100.0
1 50 Actual 2018 2 150.0
2 20 Actual 2018 3 170.0
3 123 Actual 2018 4 293.0
4 56 Actual 2018 5 349.0
5 76 Actual 2018 6 425.0
6 98 Actual 2018 7 523.0
7 126 Actual 2018 8 649.0
8 90 Actual 2018 9 739.0
9 80 Actual 2018 10 819.0
10 67 Actual 2018 11 886.0
11 87 Actual 2018 12 973.0
12 101 Actual 2019 1 101.0
13 98 Actual 2019 2 199.0
14 76 Actual 2019 3 275.0
15 57 Actual 2019 4 332.0
16 98 Budget 2019 5 332.0
17 109 Budget 2019 6 332.0
18 123 Budget 2019 7 332.0
19 67 Budget 2019 8 332.0
20 98 Budget 2019 9 332.0
21 67 Budget 2019 10 332.0
22 98 Budget 2019 11 332.0
23 123 Budget 2019 12 332.0
I have this dataframe, I'm trying to create a new column where I want to store the difference of products sold based on code and date.
for example this is the starting dataframe:
date code sold
0 20150521 0 47
1 20150521 12 39
2 20150521 16 39
3 20150521 20 38
4 20150521 24 38
5 20150521 28 37
6 20150521 32 36
7 20150521 4 43
8 20150521 8 43
9 20150522 0 47
10 20150522 12 37
11 20150522 16 36
12 20150522 20 36
13 20150522 24 36
14 20150522 28 35
15 20150522 32 31
16 20150522 4 42
17 20150522 8 41
18 20150523 0 50
19 20150523 12 48
20 20150523 16 46
21 20150523 20 46
22 20150523 24 46
23 20150523 28 45
24 20150523 32 42
25 20150523 4 49
26 20150523 8 49
27 20150524 0 39
28 20150524 12 33
29 20150524 16 30
... ... ... ...
150 20150606 32 22
151 20150606 4 34
152 20150606 8 33
153 20150607 0 31
154 20150607 12 30
155 20150607 16 30
156 20150607 20 29
157 20150607 24 28
158 20150607 28 26
159 20150607 32 24
160 20150607 4 30
161 20150607 8 30
162 20150608 0 47
I think this could be a solution...
full_df1=full_df[full_df.date == '20150609'].reset_index(drop=True)
full_df1['code'] = full_df1['code'].astype(float)
full_df1= full_df1.sort(['code'], ascending=[False])
code date sold
8 32 20150609 33
7 28 20150609 36
6 24 20150609 37
5 20 20150609 39
4 16 20150609 42
3 12 20150609 46
2 8 20150609 49
1 4 20150609 49
0 0 20150609 50
full_df1.set_index('code')['sold'].diff().reset_index()
that gives me back this output for a single date 20150609 :
code difference
0 32 NaN
1 28 3
2 24 1
3 20 2
4 16 3
5 12 4
6 8 3
7 4 0
8 0 1
is there a better solution to have the same result in a more pythonic way?
I would like to create a new column [difference] and store the data there having as result 4 columns [date, code, sold, difference]
This exactly the kind of thing that panda's groupby functionality is built for, and I highly recommend reading and working through this documentation: panda's groupby documentation
This code replicates what you are asking for, but for every date.
df = pd.DataFrame({'date':['Mon','Mon','Mon','Tue','Tue','Tue'],'code':[10,21,30,10,21,30], 'sold':[12,13,34,10,15,20]})
df['difference'] = df.groupby('date')['sold'].diff()
df
code date sold difference
0 10 Mon 12 NaN
1 21 Mon 13 1
2 30 Mon 34 21
3 10 Tue 10 NaN
4 21 Tue 15 5
5 30 Tue 20 5