Plot the result of a groupby operation in pandas - python

I have this sample table:
ID Date Days Volume/Day
0 111 2016-01-01 20 50
1 111 2016-02-01 25 40
2 111 2016-03-01 31 35
3 111 2016-04-01 30 30
4 111 2016-05-01 31 25
5 111 2016-06-01 30 20
6 111 2016-07-01 31 20
7 111 2016-08-01 31 15
8 111 2016-09-01 29 15
9 111 2016-10-01 31 10
10 111 2016-11-01 29 5
11 111 2016-12-01 27 0
0 112 2016-01-01 31 55
1 112 2016-02-01 26 45
2 112 2016-03-01 31 40
3 112 2016-04-01 30 35
4 112 2016-04-01 31 30
5 112 2016-05-01 30 25
6 112 2016-06-01 31 25
7 112 2016-07-01 31 20
8 112 2016-08-01 30 20
9 112 2016-09-01 31 15
10 112 2016-11-01 29 10
11 112 2016-12-01 31 0
I'm trying to make my table final table look like this below after grouping by ID and Date.
ID Date CumDays Volume/Day
0 111 2016-01-01 20 50
1 111 2016-02-01 45 40
2 111 2016-03-01 76 35
3 111 2016-04-01 106 30
4 111 2016-05-01 137 25
5 111 2016-06-01 167 20
6 111 2016-07-01 198 20
7 111 2016-08-01 229 15
8 111 2016-09-01 258 15
9 111 2016-10-01 289 10
10 111 2016-11-01 318 5
11 111 2016-12-01 345 0
0 112 2016-01-01 31 55
1 112 2016-02-01 57 45
2 112 2016-03-01 88 40
3 112 2016-04-01 118 35
4 112 2016-05-01 149 30
5 112 2016-06-01 179 25
6 112 2016-07-01 210 25
7 112 2016-08-01 241 20
8 112 2016-09-01 271 20
9 112 2016-10-01 302 15
10 112 2016-11-01 331 10
11 112 2016-12-01 362 0
Next, I want to be able to extract the first value of Volume/Day per ID, all the CumDays values and all the Volume/Day values per ID and Date. So I can use them for further computation and plotting Volume/Day vs CumDays. Example for ID:111, the first value of Volume/Day will be only 50 and ID:112, it will be only 55. All CumDays values for ID:111 will be 20,45... and ID:112, it will be 31,57...For all Volume/Day --- ID:111, will be 50, 40... and ID:112 will be 55,45...
My solution:
def get_time_rate(grp_df):
t = grp_df['Days'].cumsum()
r = grp_df['Volume/Day']
return t,r
vals = df.groupby(['ID','Date']).apply(get_time_rate)
vals
Doing this, the cumulative calculation doesn't take effect at all. It returns the original Days value. This didn't allow me move further in extracting the first value of Volume/Day, all the CumDays values and all the Volume/Day values I need. Any advice or help on how to go about it will be appreciated. Thanks

Get a groupby object.
g = df.groupby('ID')
Compute columns with transform:
df['CumDays'] = g.Days.transform('cumsum')
df['First Volume/Day'] = g['Volume/Day'].transform('first')
df
ID Date Days Volume/Day CumDays First Volume/Day
0 111 2016-01-01 20 50 20 50
1 111 2016-02-01 25 40 45 50
2 111 2016-03-01 31 35 76 50
3 111 2016-04-01 30 30 106 50
4 111 2016-05-01 31 25 137 50
5 111 2016-06-01 30 20 167 50
6 111 2016-07-01 31 20 198 50
7 111 2016-08-01 31 15 229 50
8 111 2016-09-01 29 15 258 50
9 111 2016-10-01 31 10 289 50
10 111 2016-11-01 29 5 318 50
11 111 2016-12-01 27 0 345 50
0 112 2016-01-01 31 55 31 55
1 112 2016-01-02 26 45 57 55
2 112 2016-01-03 31 40 88 55
3 112 2016-01-04 30 35 118 55
4 112 2016-01-05 31 30 149 55
5 112 2016-01-06 30 25 179 55
6 112 2016-01-07 31 25 210 55
7 112 2016-01-08 31 20 241 55
8 112 2016-01-09 30 20 271 55
9 112 2016-01-10 31 15 302 55
10 112 2016-01-11 29 10 331 55
11 112 2016-01-12 31 0 362 55
If you want grouped plots, you can iterate over each groups after grouping by ID. To plot, first set index and call plot.
fig, ax = plt.subplots(figsize=(8,6))
for i, g in df2.groupby('ID'):
g.plot(x='CumDays', y='Volume/Day', ax=ax, label=str(i))
plt.show()

Related

Cannot read PDF Data into Sheets with Gspread-DataFrame

I want to read data from a PDF I downloaded using Tabula into Google Sheets, and when I transfer the data as it was read into Google Sheets, I get an error. I know the data I downloaded is dirty, but I wanted to clean it up in Google Sheets.
Downloading Data from Pdf Portion of Full Portion of Code
import tabula
import pandas as pd
file_path = 'TnPresidentbyCountyNov2016.pdf'
df = tabula.read_pdf(file_path, pages='all', multiple_tables='FALSE', stream='TRUE')
print (df)
[ Anderson 19,212 9,013 74 1,034 42 174 189 28 0 0.1
0 Bedford 11,486 3,395 25 306 8 47 75 5 0 0
1 Benton 4,716 1,474 12 83 13 11 14 2 0 0
2 Bledsoe 3,622 897 7 95 4 9 18 2 0 0
3 Blount 37,443 12,100 83 1,666 72 250 313 51 1 1
4 Bradley 29,768 7,070 66 1,098 44 143 210 29 1 1
5 Campbell 9,870 2,248 32 251 25 43 45 5 0 0
6 Cannon 4,007 1,127 8 106 7 18 29 3 0 0
7 Carroll 7,756 2,327 22 181 20 18 39 2 0 0
8 Carter 16,898 3,453 30 409 20 54 130 26 0 0
9 Cheatham 11,297 3,878 26 463 13 50 99 8 0 0
10 Chester 5,081 1,243 5 115 4 12 10 4 0 0
11 Claiborne 8,602 1,832 16 192 24 27 29 2 0 0
12 Clay 2,141 707 2 47 2 10 11 0 0 0
13 Cocke 9,791 1,981 21 211 19 27 59 2 0 2
14 Coffee 14,417 4,743 32 517 23 62 113 9 0 1
15 Crockett 3,982 1,303 7 76 3 8 13 1 0 0
16 Cumberland 20,413 5,202 37 471 26 53 99 17 0 1
17 Davidson 84,550 148,864 412 9,603 304 619 2,459 106 0 6
18 Decatur 3,588 894 5 70 4 8 16 2 0 0
19 DeKalb 5,171 1,569 10 117 6 29 49 0 0 0
20 Dickson 13,233 4,722 32 489 18 58 94 9 0 3
21 Dyer 10,180 2,816 19 193 13 27 48 3 0 0
22 Fayette 13,055 5,874 19 261 16 37 62 21 0 0
23 Fentress 6,038 1,100 10 107 14 11 37 1 0 0
24 Franklin 11,532 4,374 28 319 16 36 66 7 0 0
25 Gibson 13,786 5,258 26 305 18 36 66 8 0 0
26 Giles 7,970 2,917 16 162 11 11 41 1 0 0
27 Grainger 6,626 1,154 17 130 12 28 26 4 0 0
28 Greene 18,562 4,216 28 481 29 56 152 14 0 0
29 Grundy 3,636 999 11 80 3 13 19 0 0 0
30 Hamblen 15,857 4,075 30 443 27 73 93 8 0 0
31 Hamilton 78,733 55,316 147 5,443 138 349 1,098 121 0 0
32 Hancock 1,843 322 4 42 1 5 13 0 0 0
33 Hardeman 4,919 4,185 18 84 11 13 30 9 0 0
34 Hardin 8,012 1,622 15 134 22 48 96 0 0 0
35 Hawkins 16,648 3,507 31 397 12 52 91 7 0 3
36 Haywood 3,013 3,711 11 60 10 10 19 0 0 0
37 Henderson 8,138 1,800 13 172 9 27 39 1 0 0
38 Henry 9,508 3,063 18 223 15 27 60 4 0 0
39 Hickman 5,695 1,824 20 161 19 15 39 18 0 0
40 Houston 2,182 866 9 88 4 7 12 0 0 0
41 Humphreys 4,930 1,967 17 166 12 23 26 5 0 0
42 Jackson 3,236 1,129 2 62 1 7 17 1 0 0
43 Jefferson 14,776 3,494 34 497 22 76 115 8 0 1
44 Johnson 5,410 988 11 102 7 9 39 6 0 0
45 Knox 105,767 62,878 382 7,458 227 986 1,634 122 0 9
46 Lake 1,357 577 5 18 1 6 6 0 0 0, Lauderdale 4,884 3,056 14 87 13 10 14.1 \
0 Lawrence 12,420 2,821 21 271 13 36 77
1 Lewis 3,585 890 14 59 8 9 42
2 Lincoln 10,398 2,554 19 231 13 39 46
3 Loudon 17,610 4,919 41 573 22 77 87
Just a sample of the data I pulled. Again, not what I completely envisioned, but as a beginner coder, I wanted to clean it up in Sheets
HERE is an image of the PDF I was downloading data from.
Here is the link to download the PDF I am downloading data from
Now I want to import gspread and gpsread_dataframe to upload into a Google Sheet tab and here is where I am having problems.
EDIT: Whereas neither section included all of my coding, now the top and bottom portions include all of my coding done so far.
from oauth2client.service_account import ServiceAccountCredentials
import json
import gspread
SHEET_ID = '18xad0TbNGMPh8gUSIsEr6wNsFzcpKGbyUIQ-A4GQ1bo'
SHEET_NAME = '2016'
gc = gspread.service_account('waynetennesseedems.json')
spreadsheet = gc.open_by_key(SHEET_ID)
worksheet = spreadsheet.worksheet(SHEET_NAME)
from gspread_dataframe import set_with_dataframe
set_with_dataframe(worksheet, df, include_column_header='False')
---------------------------------------------------------------------------
AttributeError Traceback (most recent call last)
/var/folders/zc/x2w76_4121g3gzfxybkz2q480000gn/T/ipykernel_44678/2784595029.py in <module>
----> 1 set_with_dataframe(worksheet, df, include_column_header='False')
/opt/anaconda3/lib/python3.9/site-packages/gspread_dataframe.py in set_with_dataframe(worksheet, dataframe, row, col, include_index, include_column_header, resize, allow_formulas, string_escaping)
260 # If header-related params are True, the values are adjusted
261 # to allow space for the headers.
--> 262 y, x = dataframe.shape
263 index_col_size = 0
264 column_header_size = 0
AttributeError: 'list' object has no attribute 'shape'
Does it have to do with how my Data was pulled from my PDF?
It seems that df is a list, first be sure to have downloaded the tabula-py module, secondly try to pass the parameter output_format='dataframe' to the tabula.read_pdf() function, like so:
import pandas as pd
import json
import gspread
from tabula.io import read_pdf
from oauth2client.service_account import ServiceAccountCredentials
from gspread_dataframe import set_with_dataframe
file_path = 'TnPresidentbyCountyNov2016.pdf'
df = read_pdf(file_path, output_format='dataframe', pages='all', multiple_tables='FALSE', stream='TRUE')
# print (df)
SHEET_ID = '18xad0TbNGMPh8gUSIsEr6wNsFzcpKGbyUIQ-A4GQ1bo'
SHEET_NAME = '2016'
gc = gspread.service_account('waynetennesseedems.json')
spreadsheet = gc.open_by_key(SHEET_ID)
worksheet = spreadsheet.worksheet(SHEET_NAME)
set_with_dataframe(worksheet, df, include_column_header='False')
Moreover I suggest you to take a look at the PEP8 style guide, to have a better idea on how to write a well formatted script.

fb prophet daily prediction does not give accurate result for missing values

My dataframe (df) contains 2 inputs UnitShrtDescr and SchShrtDescr
.
So for particular UnitShrtDescr and SchShrtDescr it must predict next value. But my data contains lots of missing values (output for in-between dates are 0).
During prediction prophet continuously predict value for each and every day without considering in between dates output as empty. How can i resolve this?
>df #(main dataframe)
>
UnitShrtDescr SchShrtDescr y ds id
8110 50 93 1 2011-12-01 243
3437 29 87 1 2011-12-21 133
6867 43 75 1 2011-12-23 204
1102 8 23 1 2011-12-28 36
5271 36 14 1 2011-12-28 166
... ... ... ... ... ...
13138 83 0 1 2018-05-18 390
14424 92 3 1 2018-05-18 432
11556 69 0 1 2018-05-18 334
11767 69 5 1 2018-05-18 338
4458 30 102 1 2018-05-18 141
15950 rows × 5 columns
code:
model = Prophet(daily_seasonality=True)
model.add_regressor("UnitShrtDescr")
model.add_regressor("SchShrtDescr")
model.fit(df)
input regressor that i want to predict is
UnitShrtDescr=40 and SchShrtDescr=93. So i made make_future_dataframe:
future = model.make_future_dataframe(periods=100, include_history=False)
future["UnitShrtDescr"]=40
future["SchShrtDescr"]=93
Previous value for UnitShrtDescr=40 and SchShrtDescr=93 was:
>dfx[(dfx['UnitShrtDescr']==40) & (dfx['SchShrtDescr']==93)].tail(10)
>
UnitShrtDescr SchShrtDescr y ds id
6293 40 93 1 2018-02-27 189
6294 40 93 3 2018-02-28 189
6295 40 93 1 2018-03-17 189
6296 40 93 1 2018-03-29 189
6297 40 93 1 2018-03-30 189
6298 40 93 4 2018-03-31 189
6299 40 93 1 2018-04-26 189
6300 40 93 1 2018-04-27 189
6301 40 93 4 2018-04-30 189
6302 40 93 1 2018-05-16 189
Please note Gap between dates is much bigger which means y is 0 for between dates.
So when i make prediction it must predict in-between dates as 0 also.
But in this case it continuously predict y without considering in between y as 0
output = model.predict(future)
>output[['ds','yhat']].head(10)
>
ds yhat
0 2018-05-19 2.959505
1 2018-05-20 2.631181
2 2018-05-21 2.418850
3 2018-05-22 2.411914
4 2018-05-23 2.386383
5 2018-05-24 2.444841
6 2018-05-25 2.409294
7 2018-05-26 2.937428
8 2018-05-27 2.588136
9 2018-05-28 2.358953
Please Suggest Changes or better alternative for my case

How to plot multiple chart on one figure and combine with another?

# Create an axes object
axes = plt.gca()
# pass the axes object to plot function
df.plot(kind='line', x='鄉鎮別', y='男', ax=axes,figsize=(10,8));
df.plot(kind='line', x='鄉鎮別', y='女', ax=axes,figsize=(10,8));
df.plot(kind='line', x='鄉鎮別', y='合計(男+女)', ax=axes,figsize=(10,8),title='hihii',
xlabel='鄉鎮別',ylabel='人數')
It's my data.
鄉鎮別 鄰數 戶數 男 女 合計(男+女) 遷入 遷出 出生 死亡 結婚 離婚
0 苗栗市 715 32517 42956 43362 86318 212 458 33 65 28 13
1 苑裡鎮 362 15204 22979 21040 44019 118 154 17 24 9 7
2 通霄鎮 394 11557 17034 15178 32212 73 113 5 33 3 3
3 竹南鎮 518 32061 44069 43275 87344 410 392 31 59 35 11
4 頭份市 567 38231 52858 52089 104947 363 404 39 69 31 19
5 後龍鎮 367 12147 18244 16274 34518 93 144 12 41 2 7
6 卓蘭鎮 176 5861 8206 7504 15710 29 51 1 11 2 0
7 大湖鄉 180 5206 7142 6238 13380 31 59 5 21 3 2
8 公館鄉 281 10842 16486 15159 31645 89 169 12 32 5 3
9 銅鑼鄉 218 6106 8887 7890 16777 57 62 7 13 4 1
10 南庄鄉 184 3846 5066 4136 9202 22 48 1 10 0 2
11 頭屋鄉 120 3596 5289 4672 9961 59 53 2 11 4 4
12 三義鄉 161 5625 8097 7205 15302 47 63 3 12 3 5
13 西湖鄉 108 2617 3653 2866 6519 38 20 1 17 3 0
14 造橋鄉 115 4144 6276 5545 11821 44 64 3 11 3 2
15 三灣鄉 93 2331 3395 2832 6227 27 18 2 9 0 2
16 獅潭鄉 98 1723 2300 1851 4151 28 10 1 4 0 0
17 泰安鄉 64 1994 3085 2642 5727 36 26 2 8 4 1
18 總計 4721 195608 276022 259758 535780 1776 2308 177 450 139 82
This my output df.plot
First question is how to display Chinese?
Second is can I use without df.plot to plot line chart?
last question is : There are four graphs(use subplot): the line graphs of male and female population and total population(男、女、合計(男+女)) in each township; the line graphs of in-migration and out-migration(遷入和遷出); the long bar graphs of household number(戶數); and the line graphs of births and deaths(出生和死亡).

Filtering static/stationary areas

I was trying to filter my sensor data. My objective is to filter the sensor data where the data is more or less stationary over a period of time. can anyone help me in this
time : 1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
sensor : 121
115
122
123
116
117
113
116
113
114
115
112
116
129
123
125
130
120
121
122
this is a sample data, i need to take the first data and compare it to the next 20 seconds of data, if all the 20 datas is in the the range of +or- 10 then i need to filter these 20 datas to another column, and i need to continue this process of filtering
However your question is not very clear but from my understanding what you want is between time duration of 20 seconds if the sensor is in between the range of +10 and -10 from the first reading then you have to append those values to new column and above or below that should not be considered. I tried replicating your DataFrame and you could go ahead in this way:
import pandas as pd
data = {'time':[1, 2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,23],
'sensor':[121, 115, 122, 123,116,117,113,116,113,114,115,112,116,129,123,125,130,120,121,122,123,124,144]}
df_new = pd.DataFrame(data) #I am taking time duration of 23 seconds where 23rd second data is out of range as 144 - 121 > 10
time sensor
0 1 121
1 2 115
2 3 122
3 4 123
4 5 116
5 6 117
6 7 113
7 8 116
8 9 113
9 10 114
10 11 115
11 12 112
12 13 116
13 14 129
14 15 123
15 16 125
16 17 130
17 18 120
18 19 121
19 20 122
20 21 123
21 22 124
22 23 144
list = []
for i in range(0, len(df_new['sensor'])):
if 0 <= df_new['time'][i] - df_new['time'][0] <= 23: #you take here 20 which is your requirement instead of 23 as I am doing to demonstrate for the value of 144
if -10 < df_new['sensor'][0] - df_new['sensor'][i] < 10:
list.append(df_new['sensor'][i])
else:
list.append('out of range')
else:
break
df_new['result'] = list
df_new
time sensor result
0 1 121 121
1 2 115 115
2 3 122 122
3 4 123 123
4 5 116 116
5 6 117 117
6 7 113 113
7 8 116 116
8 9 113 113
9 10 114 114
10 11 115 115
11 12 112 112
12 13 116 116
13 14 129 129
14 15 123 123
15 16 125 125
16 17 130 130
17 18 120 120
18 19 121 121
19 20 122 122
20 21 123 123
21 22 124 124
22 23 144 out of range
There is no sample data. Generated. Clearly filter on time could be two date times, I've just picked certain hours. For stable, example selected values that are between 45th & 55th percentile.
import numpy as np
t = pd.date_range(dt.date(2021,1,10), dt.date(2021,1,11), freq="min")
df = pd.DataFrame({"time":t, "val":np.random.dirichlet(np.ones(len(t)),size=1)[0]})
# filter on hour and val. val between 45th and 55th percentile
df2 = df[df.time.dt.hour.between(3,4) & df.val.between(df.val.quantile(.45), df.val.quantile(.55))]
output
time val
2021-01-10 03:13:00 0.000499
2021-01-10 03:41:00 0.000512
2021-01-10 04:00:00 0.000541
2021-01-10 04:39:00 0.000413
rolling window
Question was updated to state stable is defined as next window rows with a +/- rng output in a new column.
Using this definition, using rolling() capability with a lambda function to check that all subsequent rows within window are within tolerance levels of the first observation in the window. Any observation out of this range will return NaN. Also note last rows will return NaN as there are insufficient remaining rows to do test.
import pandas as pd
import io
import datetime as dt
import numpy as np
from distutils.version import StrictVersion
df = pd.read_csv(io.StringIO("""sensor
121
115
122
123
116
117
113
116
113
114
115
112
116
129
123
125
130
120
121
122"""))
df["time"] = pd.date_range(dt.date(2021,1,10), freq="s", periods=len(df))
# how many rows to compare
window = 5
# */- range
rng = 10
if StrictVersion(pd.__version__) < StrictVersion("1.0.0"):
df["stable"] = df["sensor"].rolling(window).apply(lambda x: np.where(pd.Series(x).between(x[0]-rng,x[0]+rng).all(), x[0], np.nan)).shift(-(window-1))
else:
df["stable"] = df.rolling(window).apply(lambda x: np.where(x.between(x.values[0]-rng,x.values[0]+rng).all(), x.values[0], np.nan)).shift(-(window-1))
output
sensor time stable
121 2021-01-10 00:00:00 121.0
115 2021-01-10 00:00:01 115.0
122 2021-01-10 00:00:02 122.0
123 2021-01-10 00:00:03 123.0
116 2021-01-10 00:00:04 116.0
117 2021-01-10 00:00:05 117.0
113 2021-01-10 00:00:06 113.0
116 2021-01-10 00:00:07 116.0
113 2021-01-10 00:00:08 113.0
114 2021-01-10 00:00:09 NaN
115 2021-01-10 00:00:10 NaN
112 2021-01-10 00:00:11 NaN
116 2021-01-10 00:00:12 NaN
129 2021-01-10 00:00:13 129.0
123 2021-01-10 00:00:14 123.0
125 2021-01-10 00:00:15 125.0
130 2021-01-10 00:00:16 NaN
120 2021-01-10 00:00:17 NaN
121 2021-01-10 00:00:18 NaN
122 2021-01-10 00:00:19 NaN

Issue combining columns in Dataframe?

I have the following dataframe:
Obj BIT BIT BIT GAS GAS GAS OIL OIL OIL
Date
2007-01-03 18 7 0 184 35 2 52 14 0
2007-01-09 43 3 0 249 35 2 68 11 1
2007-01-16 60 6 0 254 35 5 72 13 1
2007-01-23 69 11 1 255 43 2 81 6 0
2007-01-30 74 8 0 263 29 4 69 9 0
2007-02-06 78 6 1 259 34 2 79 6 0
2007-02-14 76 9 1 263 24 2 70 10 1
2007-02-20 85 7 0 241 20 6 72 4 0
2007-02-27 79 6 0 242 35 3 68 7 0
2007-03-06 68 14 0 225 26 2 57 10 1
How can I sum each of the 9 columns into 3 columns. "BIT","GAS" and "OIL"
This is the code for the dataframe which basically just gets me a cross section from a larger df I want:
ABrigsA = ndfAB.xs(['BIT','GAS','OIL'],axis=1)
Any suggestions?
Assuming that you want to sum similarly-named columns, you can use groupby [tutorial docs]:
>>> df.groupby(level=0, axis='columns').sum()
Obj BIT GAS OIL
Date
2007-01-03 25 221 66
2007-01-09 46 286 80
2007-01-16 66 294 86
2007-01-23 81 300 87
2007-01-30 82 296 78
2007-02-06 85 295 85
2007-02-14 86 289 81
2007-02-20 92 267 76
2007-02-27 85 280 75
2007-03-06 82 253 68

Categories

Resources