Cannot read PDF Data into Sheets with Gspread-DataFrame - python
I want to read data from a PDF I downloaded using Tabula into Google Sheets, and when I transfer the data as it was read into Google Sheets, I get an error. I know the data I downloaded is dirty, but I wanted to clean it up in Google Sheets.
Downloading Data from Pdf Portion of Full Portion of Code
import tabula
import pandas as pd
file_path = 'TnPresidentbyCountyNov2016.pdf'
df = tabula.read_pdf(file_path, pages='all', multiple_tables='FALSE', stream='TRUE')
print (df)
[ Anderson 19,212 9,013 74 1,034 42 174 189 28 0 0.1
0 Bedford 11,486 3,395 25 306 8 47 75 5 0 0
1 Benton 4,716 1,474 12 83 13 11 14 2 0 0
2 Bledsoe 3,622 897 7 95 4 9 18 2 0 0
3 Blount 37,443 12,100 83 1,666 72 250 313 51 1 1
4 Bradley 29,768 7,070 66 1,098 44 143 210 29 1 1
5 Campbell 9,870 2,248 32 251 25 43 45 5 0 0
6 Cannon 4,007 1,127 8 106 7 18 29 3 0 0
7 Carroll 7,756 2,327 22 181 20 18 39 2 0 0
8 Carter 16,898 3,453 30 409 20 54 130 26 0 0
9 Cheatham 11,297 3,878 26 463 13 50 99 8 0 0
10 Chester 5,081 1,243 5 115 4 12 10 4 0 0
11 Claiborne 8,602 1,832 16 192 24 27 29 2 0 0
12 Clay 2,141 707 2 47 2 10 11 0 0 0
13 Cocke 9,791 1,981 21 211 19 27 59 2 0 2
14 Coffee 14,417 4,743 32 517 23 62 113 9 0 1
15 Crockett 3,982 1,303 7 76 3 8 13 1 0 0
16 Cumberland 20,413 5,202 37 471 26 53 99 17 0 1
17 Davidson 84,550 148,864 412 9,603 304 619 2,459 106 0 6
18 Decatur 3,588 894 5 70 4 8 16 2 0 0
19 DeKalb 5,171 1,569 10 117 6 29 49 0 0 0
20 Dickson 13,233 4,722 32 489 18 58 94 9 0 3
21 Dyer 10,180 2,816 19 193 13 27 48 3 0 0
22 Fayette 13,055 5,874 19 261 16 37 62 21 0 0
23 Fentress 6,038 1,100 10 107 14 11 37 1 0 0
24 Franklin 11,532 4,374 28 319 16 36 66 7 0 0
25 Gibson 13,786 5,258 26 305 18 36 66 8 0 0
26 Giles 7,970 2,917 16 162 11 11 41 1 0 0
27 Grainger 6,626 1,154 17 130 12 28 26 4 0 0
28 Greene 18,562 4,216 28 481 29 56 152 14 0 0
29 Grundy 3,636 999 11 80 3 13 19 0 0 0
30 Hamblen 15,857 4,075 30 443 27 73 93 8 0 0
31 Hamilton 78,733 55,316 147 5,443 138 349 1,098 121 0 0
32 Hancock 1,843 322 4 42 1 5 13 0 0 0
33 Hardeman 4,919 4,185 18 84 11 13 30 9 0 0
34 Hardin 8,012 1,622 15 134 22 48 96 0 0 0
35 Hawkins 16,648 3,507 31 397 12 52 91 7 0 3
36 Haywood 3,013 3,711 11 60 10 10 19 0 0 0
37 Henderson 8,138 1,800 13 172 9 27 39 1 0 0
38 Henry 9,508 3,063 18 223 15 27 60 4 0 0
39 Hickman 5,695 1,824 20 161 19 15 39 18 0 0
40 Houston 2,182 866 9 88 4 7 12 0 0 0
41 Humphreys 4,930 1,967 17 166 12 23 26 5 0 0
42 Jackson 3,236 1,129 2 62 1 7 17 1 0 0
43 Jefferson 14,776 3,494 34 497 22 76 115 8 0 1
44 Johnson 5,410 988 11 102 7 9 39 6 0 0
45 Knox 105,767 62,878 382 7,458 227 986 1,634 122 0 9
46 Lake 1,357 577 5 18 1 6 6 0 0 0, Lauderdale 4,884 3,056 14 87 13 10 14.1 \
0 Lawrence 12,420 2,821 21 271 13 36 77
1 Lewis 3,585 890 14 59 8 9 42
2 Lincoln 10,398 2,554 19 231 13 39 46
3 Loudon 17,610 4,919 41 573 22 77 87
Just a sample of the data I pulled. Again, not what I completely envisioned, but as a beginner coder, I wanted to clean it up in Sheets
HERE is an image of the PDF I was downloading data from.
Here is the link to download the PDF I am downloading data from
Now I want to import gspread and gpsread_dataframe to upload into a Google Sheet tab and here is where I am having problems.
EDIT: Whereas neither section included all of my coding, now the top and bottom portions include all of my coding done so far.
from oauth2client.service_account import ServiceAccountCredentials
import json
import gspread
SHEET_ID = '18xad0TbNGMPh8gUSIsEr6wNsFzcpKGbyUIQ-A4GQ1bo'
SHEET_NAME = '2016'
gc = gspread.service_account('waynetennesseedems.json')
spreadsheet = gc.open_by_key(SHEET_ID)
worksheet = spreadsheet.worksheet(SHEET_NAME)
from gspread_dataframe import set_with_dataframe
set_with_dataframe(worksheet, df, include_column_header='False')
---------------------------------------------------------------------------
AttributeError Traceback (most recent call last)
/var/folders/zc/x2w76_4121g3gzfxybkz2q480000gn/T/ipykernel_44678/2784595029.py in <module>
----> 1 set_with_dataframe(worksheet, df, include_column_header='False')
/opt/anaconda3/lib/python3.9/site-packages/gspread_dataframe.py in set_with_dataframe(worksheet, dataframe, row, col, include_index, include_column_header, resize, allow_formulas, string_escaping)
260 # If header-related params are True, the values are adjusted
261 # to allow space for the headers.
--> 262 y, x = dataframe.shape
263 index_col_size = 0
264 column_header_size = 0
AttributeError: 'list' object has no attribute 'shape'
Does it have to do with how my Data was pulled from my PDF?
It seems that df is a list, first be sure to have downloaded the tabula-py module, secondly try to pass the parameter output_format='dataframe' to the tabula.read_pdf() function, like so:
import pandas as pd
import json
import gspread
from tabula.io import read_pdf
from oauth2client.service_account import ServiceAccountCredentials
from gspread_dataframe import set_with_dataframe
file_path = 'TnPresidentbyCountyNov2016.pdf'
df = read_pdf(file_path, output_format='dataframe', pages='all', multiple_tables='FALSE', stream='TRUE')
# print (df)
SHEET_ID = '18xad0TbNGMPh8gUSIsEr6wNsFzcpKGbyUIQ-A4GQ1bo'
SHEET_NAME = '2016'
gc = gspread.service_account('waynetennesseedems.json')
spreadsheet = gc.open_by_key(SHEET_ID)
worksheet = spreadsheet.worksheet(SHEET_NAME)
set_with_dataframe(worksheet, df, include_column_header='False')
Moreover I suggest you to take a look at the PEP8 style guide, to have a better idea on how to write a well formatted script.
Related
How to group and sum rows by ID and subtract from group of rows with same ID? [python]
I have the following dataframe: ID_A ID_B ID_C Value ID_B_Value_Sum ----------------------------------------------- 0 22 5 1 54 208 1 23 5 2 34 208 2 24 6 1 44 268 3 25 6 1 64 268 4 26 5 2 35 208 5 27 7 3 45 229 6 28 7 2 66 229 7 29 8 1 76 161 8 30 8 2 25 161 9 31 6 2 27 268 10 32 5 3 14 208 11 33 5 3 17 208 12 34 6 2 43 268 13 35 6 2 53 268 14 36 8 1 22 161 15 37 7 3 65 229 16 38 7 1 53 229 17 39 8 2 23 161 18 40 8 3 15 161 19 41 6 3 37 268 20 42 5 2 54 208 Each row contains a unique "ID_A", while different rows can have the same "ID_B" and "ID_C". Each row corresponds to its own "Value", where this "Value" can be the same between different rows. The "ID_B_Value_Sum" column contains the sums of all values from the "Value" column for all rows containing the same "ID_B". Calculating this sum is straightforward with python and pandas. What I want to do is, for each row, take the "ID_B_Value_Sum" column, but subtract all values corresponding to rows containing the same "ID_C", exclusive of the target row. For example, taking "ID_B" = 6, we see the sum of all the "Value" values from this "ID_B" = 6 group = 268, as shown in all corresponding rows in the "ID_B_Value_Sum" column. Now, two of the rows in this group contain "ID_C" = 1, three rows in this group contain "ID_C" = 2, and one row in this group contain "ID_C" = 3. Starting with row 2, with "ID_C" = 1, this means taking the corresponding "ID_B_Value_Sum" value and subtracting the "Value" values from all other rows containing both "ID_B" = 6 and "ID_C = 1", exclusive of the target row. And so for row 2 I take 268 - 64 = 204. And for another example, for row 4, this means 208 - 34 - 54 = 120. And another example, for row 7, this means 161 - 22 = 139. These new values will go in a new "Value_Sum_New" column for each row. And so I want to produce the following output dataframe: ID_A ID_B ID_C Value ID_B_Value_Sum Value_Sum_New --------------------------------------------------------------- 0 22 5 1 54 208 XX 1 23 5 2 34 208 XX 2 24 6 1 44 268 204 3 25 6 1 64 268 XX 4 26 5 2 35 208 120 5 27 7 3 45 229 XX 6 28 7 2 66 229 XX 7 29 8 1 76 161 139 8 30 8 2 25 161 XX 9 31 6 2 27 268 XX 10 32 5 3 14 208 XX 11 33 5 3 17 208 XX 12 34 6 2 43 268 XX 13 35 6 2 53 268 XX 14 36 8 1 22 161 XX 15 37 7 3 65 229 XX 16 38 7 1 53 229 XX 17 39 8 2 23 161 XX 18 40 8 3 15 161 XX 19 41 6 3 37 268 XX 20 42 5 2 54 208 XX What I am having trouble with conceptualizing is how to, for each row, group together all columns with the same "ID_B" and then group together all of those rows and sub-group all rows with the same "ID_C" and subtract their sum from the "Value" of the target row, but still including the "Value" from the target row, to create the final "Value_Sum_New". It seems like so many actions and sub-actions to take and I am confused with how to approach this in a simple and streamlined manner, as I am confused with how to organize and order the workflow. How might I approach calculating this sum in python?
IIUC, you need: df['Value_Sum_New'] = (df['ID_B_Value_Sum'] - df.groupby(['ID_B', 'ID_C'])['Value'].transform('sum') + df['Value'] ) output: ID_A ID_B ID_C Value ID_B_Value_Sum Value_Sum_New 0 22 5 1 54 208 208 1 23 5 2 34 208 119 2 24 6 1 44 268 204 3 25 6 1 64 268 224 4 26 5 2 35 208 120 5 27 7 3 45 229 164 6 28 7 2 66 229 229 7 29 8 1 76 161 139 8 30 8 2 25 161 138 9 31 6 2 27 268 172 10 32 5 3 14 208 191 11 33 5 3 17 208 194 12 34 6 2 43 268 188 13 35 6 2 53 268 198 14 36 8 1 22 161 85 15 37 7 3 65 229 184 16 38 7 1 53 229 229 17 39 8 2 23 161 136 18 40 8 3 15 161 161 19 41 6 3 37 268 268 20 42 5 2 54 208 139 explanation As you said, computing a sum per group is easy in pandas. You can actually compute ID_B_Value_Sum with: df['ID_B_Value_Sum'] = df.groupby('ID_B')['Value'].transform('sum') Now we do the same for groups of ID_B + ID_C, we subtract it from ID_B_Value_Sum, and as we want to exclude only the other rows in the group, we add back the row Value itself.
How to plot multiple chart on one figure and combine with another?
# Create an axes object axes = plt.gca() # pass the axes object to plot function df.plot(kind='line', x='鄉鎮別', y='男', ax=axes,figsize=(10,8)); df.plot(kind='line', x='鄉鎮別', y='女', ax=axes,figsize=(10,8)); df.plot(kind='line', x='鄉鎮別', y='合計(男+女)', ax=axes,figsize=(10,8),title='hihii', xlabel='鄉鎮別',ylabel='人數') It's my data. 鄉鎮別 鄰數 戶數 男 女 合計(男+女) 遷入 遷出 出生 死亡 結婚 離婚 0 苗栗市 715 32517 42956 43362 86318 212 458 33 65 28 13 1 苑裡鎮 362 15204 22979 21040 44019 118 154 17 24 9 7 2 通霄鎮 394 11557 17034 15178 32212 73 113 5 33 3 3 3 竹南鎮 518 32061 44069 43275 87344 410 392 31 59 35 11 4 頭份市 567 38231 52858 52089 104947 363 404 39 69 31 19 5 後龍鎮 367 12147 18244 16274 34518 93 144 12 41 2 7 6 卓蘭鎮 176 5861 8206 7504 15710 29 51 1 11 2 0 7 大湖鄉 180 5206 7142 6238 13380 31 59 5 21 3 2 8 公館鄉 281 10842 16486 15159 31645 89 169 12 32 5 3 9 銅鑼鄉 218 6106 8887 7890 16777 57 62 7 13 4 1 10 南庄鄉 184 3846 5066 4136 9202 22 48 1 10 0 2 11 頭屋鄉 120 3596 5289 4672 9961 59 53 2 11 4 4 12 三義鄉 161 5625 8097 7205 15302 47 63 3 12 3 5 13 西湖鄉 108 2617 3653 2866 6519 38 20 1 17 3 0 14 造橋鄉 115 4144 6276 5545 11821 44 64 3 11 3 2 15 三灣鄉 93 2331 3395 2832 6227 27 18 2 9 0 2 16 獅潭鄉 98 1723 2300 1851 4151 28 10 1 4 0 0 17 泰安鄉 64 1994 3085 2642 5727 36 26 2 8 4 1 18 總計 4721 195608 276022 259758 535780 1776 2308 177 450 139 82 This my output df.plot First question is how to display Chinese? Second is can I use without df.plot to plot line chart? last question is : There are four graphs(use subplot): the line graphs of male and female population and total population(男、女、合計(男+女)) in each township; the line graphs of in-migration and out-migration(遷入和遷出); the long bar graphs of household number(戶數); and the line graphs of births and deaths(出生和死亡).
how to split an integer value from one column to two columns in text file using pandas or numpy (python)
I have a text file which has a number of integer values like this. 20180701 20180707 52 11 1 2 4 1 0 0 10 7 1 3 1 0 4 5 2 20180708 20180714 266 8 19 3 2 9 7 25 20 17 12 9 9 27 34 54 11 20180715 20180721 654 52 34 31 20 16 12 25 84 31 38 37 38 69 66 87 14 20180722 201807281017 110 72 46 52 29 29 22 204 41 46 51 57 67 82 92 17 20180729 201808041106 276 37 11 87 20 10 8 284 54 54 72 38 49 41 53 12 20180805 20180811 624 78 19 15 55 16 8 9 172 15 31 35 38 47 29 36 21 20180812 20180818 488 63 17 7 26 10 9 7 116 17 14 39 31 34 27 64 7 20180819 20180825 91 4 7 0 4 5 1 3 16 3 4 5 10 10 7 11 1 20180826 20180901 49 2 2 1 0 4 0 1 2 0 1 4 8 2 6 6 10 I have to make a file by merging several files like this but you guys can see a problem with this data. In 4 and 5 lines, the first values, 1017 and 1106, right next to period index make a problem. When I try to read these two lines, I always have had this result. It came out that first values in first column next to index columns couldn't recognized as first values themselves. In [14]: fw.iloc[80,:] Out[14]: 3 72.0 4 46.0 5 52.0 6 29.0 7 29.0 8 22.0 9 204.0 10 41.0 11 46.0 12 51.0 13 57.0 14 67.0 15 82.0 16 92.0 17 17.0 18 NaN Name: (20180722, 201807281017), dtype: float64 I tried to make it correct with indexing but failed. The desirable result is, In [14]: fw.iloc[80,:] Out[14]: 2 1017.0 3 110.0 4 72.0 5 46.0 6 52.0 7 29.0 8 29.0 9 22.0 10 204.0 11 41.0 12 46.0 13 51.0 14 57.0 15 67.0 16 82.0 17 92.0 18 17.0 Name: (20180722, 201807281017), dtype: float64 How can I solve this problem? + I used this code to read this file. fw = pd.read_csv('warm_patient.txt', index_col=[0,1], header=None, delim_whitespace=True)
A better fit for this would be pandas.read_fwf. For your example: df = pd.read_fwf(filename, index_col=[0,1], header=None, widths=2*[10]+17*[4]) I don't know if the column widths can be inferred for all your data or need to be hardcoded.
One possibility would be to manually construct the dataframe, this way we can parse the text by splitting the values every 4 characters. from textwrap import wrap import pandas as pd def read_file(f_name): data = [] with open(f_name) as f: for line in f.readlines(): idx1 = line[0:8] idx2 = line[10:18] points = map(lambda x: int(x.replace(" ", "")), wrap(line.rstrip()[18:], 4)) data.append([idx1, idx2, *points]) return pd.DataFrame(data).set_index([0, 1])
It could be made somewhat more efficient (in particular if this is a particularly long text file), but here's one solution. fw = pd.read_csv('test.txt', header=None, delim_whitespace=True) for i in fw[pd.isna(fw.iloc[:,-1])].index: num_str = str(fw.iat[i,1]) a,b = map(int,[num_str[:-4],num_str[-4:]]) fw.iloc[i,3:] = fw.iloc[i,2:-1] fw.iloc[i,:3] = [fw.iat[i,0],a,b] fw = fw.set_index([0,1]) The result of print(fw) from there is 2 3 4 5 6 7 8 9 10 11 12 13 14 15 \ 0 1 20180701 20180707 52 11 1 2 4 1 0 0 10 7 1 3 1 0 20180708 20180714 266 8 19 3 2 9 7 25 20 17 12 9 9 27 20180715 20180721 654 52 34 31 20 16 12 25 84 31 38 37 38 69 20180722 20180728 1017 110 72 46 52 29 29 22 204 41 46 51 57 67 20180729 20180804 1106 276 37 11 87 20 10 8 284 54 54 72 38 49 20180805 20180811 624 78 19 15 55 16 8 9 172 15 31 35 38 47 20180812 20180818 488 63 17 7 26 10 9 7 116 17 14 39 31 34 20180819 20180825 91 4 7 0 4 5 1 3 16 3 4 5 10 10 20180826 20180901 49 2 2 1 0 4 0 1 2 0 1 4 8 2 16 17 18 0 1 20180701 20180707 4 5 2.0 20180708 20180714 34 54 11.0 20180715 20180721 66 87 14.0 20180722 20180728 82 92 17.0 20180729 20180804 41 53 12.0 20180805 20180811 29 36 21.0 20180812 20180818 27 64 7.0 20180819 20180825 7 11 1.0 20180826 20180901 6 6 10.0 Here's the result of the print after applying your initial solution of fw = pd.read_csv('test.txt', index_col=[0,1], header=None, delim_whitespace=True) for comparison. 2 3 4 5 6 7 8 9 10 11 12 13 14 \ 0 1 20180701 20180707 52 11 1 2 4 1 0 0 10 7 1 3 1 20180708 20180714 266 8 19 3 2 9 7 25 20 17 12 9 9 20180715 20180721 654 52 34 31 20 16 12 25 84 31 38 37 38 20180722 201807281017 110 72 46 52 29 29 22 204 41 46 51 57 67 20180729 201808041106 276 37 11 87 20 10 8 284 54 54 72 38 49 20180805 20180811 624 78 19 15 55 16 8 9 172 15 31 35 38 20180812 20180818 488 63 17 7 26 10 9 7 116 17 14 39 31 20180819 20180825 91 4 7 0 4 5 1 3 16 3 4 5 10 20180826 20180901 49 2 2 1 0 4 0 1 2 0 1 4 8 15 16 17 18 0 1 20180701 20180707 0 4 5 2.0 20180708 20180714 27 34 54 11.0 20180715 20180721 69 66 87 14.0 20180722 201807281017 82 92 17 NaN 20180729 201808041106 41 53 12 NaN 20180805 20180811 47 29 36 21.0 20180812 20180818 34 27 64 7.0 20180819 20180825 10 7 11 1.0 20180826 20180901 2 6 6 10.0
Reading XML (NIST .n42 file) and data extraction
I have a xml file that I need to extract data from 'channelData' in the below xml. from xml.dom import minidom xmldoc = minidom.parse('Annex_B_n42.xml') itemlist = xmldoc.getElementsByTagName('ChannelData') print(len(itemlist)) print(itemlist[0].attributes['compressionCode'].value) for s in itemlist: print(s.attributes['compressionCode'].value) Which doesn't return the data, just the value 'None'. I also tried another approach from an another example: import xml.etree.ElementTree as ET root = ET.parse('Annex_B_n42.xml').getroot() #value=[] for type_tag in root.findall('Spectrum'): value = type_tag.get('id') print(value) print("data from file " +str(value)) This did not work at all and value is not being populated. I really don't understand how to parse xml. Here is the xml file <?xml version="1.0"?> <?xml-model href="http://physics.nist.gov/N42/2011/N42/schematron/n42.sch" type="application/xml" schematypens="http://purl.oclc.org/dsdl/schematron"?> <RadInstrumentData xmlns="http://physics.nist.gov/N42/2011/N42" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="http://physics.nist.gov/N42/2011/N42 file:///d:/Data%20Files/ANSI%20N42%2042/V2/Schema/n42.xsd" n42DocUUID="d72b7fa7-4a20-43d4-b1b2-7e3b8c6620c1"> <RadInstrumentInformation id="RadInstrumentInformation-1"> <RadInstrumentManufacturerName>RIIDs R Us</RadInstrumentManufacturerName> <RadInstrumentModelName>iRIID</RadInstrumentModelName> <RadInstrumentClassCode>Radionuclide Identifier</RadInstrumentClassCode> <RadInstrumentVersion> <RadInstrumentComponentName>Software</RadInstrumentComponentName> <RadInstrumentComponentVersion>1.1</RadInstrumentComponentVersion> </RadInstrumentVersion> </RadInstrumentInformation> <RadDetectorInformation id="RadDetectorInformation-1"> <RadDetectorCategoryCode>Gamma</RadDetectorCategoryCode> <RadDetectorKindCode>NaI</RadDetectorKindCode> </RadDetectorInformation> <EnergyCalibration id="EnergyCalibration-1"> <CoefficientValues>-21.8 12.1 6.55e-03</CoefficientValues> </EnergyCalibration> <RadMeasurement id="RadMeasurement-1"> <MeasurementClassCode>Foreground</MeasurementClassCode> <StartDateTime>2003-11-22T23:45:19-07:00</StartDateTime> <RealTimeDuration>PT60S</RealTimeDuration> <Spectrum id="RadMeasurement-1Spectrum-1" radDetectorInformationReference="RadDetectorInformation-1" energyCalibrationReference="EnergyCalibration-1"> <LiveTimeDuration>PT59.61S</LiveTimeDuration> <ChannelData compressionCode="None"> 0 0 0 22 421 847 1295 1982 2127 2222 2302 2276 2234 1921 1939 1715 1586 1469 1296 1178 1127 1047 928 760 679 641 542 529 443 423 397 393 322 272 294 227 216 224 208 191 189 163 167 173 150 137 136 129 150 142 160 159 140 103 90 82 83 85 67 76 73 84 63 74 70 69 76 61 49 61 63 65 58 62 48 75 56 61 46 56 43 37 55 47 50 40 38 54 43 41 45 51 32 35 29 33 40 44 33 35 20 26 27 17 19 20 16 19 18 19 18 20 17 45 55 70 62 59 32 30 21 23 10 9 5 13 11 11 6 7 7 9 11 4 8 8 14 14 11 9 13 5 5 6 10 9 3 4 3 7 5 5 4 5 3 6 5 0 5 6 3 1 4 4 3 10 11 4 1 4 2 11 9 6 3 5 5 1 4 2 6 6 2 3 0 2 2 2 2 0 1 3 1 1 2 3 2 4 5 2 6 4 1 0 3 1 2 1 1 0 1 0 0 2 0 1 0 0 0 1 0 0 0 0 0 0 0 2 0 0 0 1 0 1 0 0 2 1 0 0 0 0 1 3 0 0 0 1 0 1 0 0 0 0 0 0 </ChannelData> </Spectrum> </RadMeasurement> </RadInstrumentData>
You can use BeautifulSoup to get the channeldata tag value like following from bs4 import BeautifulSoup with open('Annex_B_n42.xml') as f: xml = f.read() bs_obj = BeautifulSoup(xml) print(bs_obj.find_all("channeldata")[0].text) That will print you ' 0 0 0 22 421 847 1295 1982 2127 2222 2302 2276 2234 1921 1939 1715 1586 1469 1296 1178 1127 1047 928 760 679 641 542 529 443 423 397 393 322 272 294 227 216 224 208 191 189 163 167 173 150 137 136 129 150 142 160 159 140 103 90 82 83 85 67 76 73 84 63 74 70 69 76 61 49 61 63 65 58 62 48 75 56 61 46 56 43 37 55 47 50 40 38 54 43 41 45 51 32 35 29 33 40 44 33 35 20 26 27 17 19 20 16 19 18 19 18 20 17 45 55 70 62 59 32 30 21 23 10 9 5 13 11 11 6 7 7 9 11 4 8 8 14 14 11 9 13 5 5 6 10 9 3 4 3 7 5 5 4 5 3 6 5 0 5 6 3 1 4 4 3 10 11 4 1 4 2 11 9 6 3 5 5 1 4 2 6 6 2 3 0 2 2 2 2 0 1 3 1 1 2 3 2 4 5 2 6 4 1 0 3 1 2 1 1 0 1 0 0 2 0 1 0 0 0 1 0 0 0 0 0 0 0 2 0 0 0 1 0 1 0 0 2 1 0 0 0 0 1 3 0 0 0 1 0 1 0 0 0 0 0 0 '
Try this: import xml.etree.ElementTree as ET root = ET.parse('Annex_B_n42.xml').getroot() elems = root.findall(".//*[#compressionCode='None']") print(elems[0].text)
performing differences between rows in pandas based on columns values
I have this dataframe, I'm trying to create a new column where I want to store the difference of products sold based on code and date. for example this is the starting dataframe: date code sold 0 20150521 0 47 1 20150521 12 39 2 20150521 16 39 3 20150521 20 38 4 20150521 24 38 5 20150521 28 37 6 20150521 32 36 7 20150521 4 43 8 20150521 8 43 9 20150522 0 47 10 20150522 12 37 11 20150522 16 36 12 20150522 20 36 13 20150522 24 36 14 20150522 28 35 15 20150522 32 31 16 20150522 4 42 17 20150522 8 41 18 20150523 0 50 19 20150523 12 48 20 20150523 16 46 21 20150523 20 46 22 20150523 24 46 23 20150523 28 45 24 20150523 32 42 25 20150523 4 49 26 20150523 8 49 27 20150524 0 39 28 20150524 12 33 29 20150524 16 30 ... ... ... ... 150 20150606 32 22 151 20150606 4 34 152 20150606 8 33 153 20150607 0 31 154 20150607 12 30 155 20150607 16 30 156 20150607 20 29 157 20150607 24 28 158 20150607 28 26 159 20150607 32 24 160 20150607 4 30 161 20150607 8 30 162 20150608 0 47 I think this could be a solution... full_df1=full_df[full_df.date == '20150609'].reset_index(drop=True) full_df1['code'] = full_df1['code'].astype(float) full_df1= full_df1.sort(['code'], ascending=[False]) code date sold 8 32 20150609 33 7 28 20150609 36 6 24 20150609 37 5 20 20150609 39 4 16 20150609 42 3 12 20150609 46 2 8 20150609 49 1 4 20150609 49 0 0 20150609 50 full_df1.set_index('code')['sold'].diff().reset_index() that gives me back this output for a single date 20150609 : code difference 0 32 NaN 1 28 3 2 24 1 3 20 2 4 16 3 5 12 4 6 8 3 7 4 0 8 0 1 is there a better solution to have the same result in a more pythonic way? I would like to create a new column [difference] and store the data there having as result 4 columns [date, code, sold, difference]
This exactly the kind of thing that panda's groupby functionality is built for, and I highly recommend reading and working through this documentation: panda's groupby documentation This code replicates what you are asking for, but for every date. df = pd.DataFrame({'date':['Mon','Mon','Mon','Tue','Tue','Tue'],'code':[10,21,30,10,21,30], 'sold':[12,13,34,10,15,20]}) df['difference'] = df.groupby('date')['sold'].diff() df code date sold difference 0 10 Mon 12 NaN 1 21 Mon 13 1 2 30 Mon 34 21 3 10 Tue 10 NaN 4 21 Tue 15 5 5 30 Tue 20 5