Dash DataTable loses first column when converting from a Pandas DataFrame - python

I'm looking to use Dash to make a DataFrame I created interactive and in a clean looking format. It only needs to be a table with the external stylesheet included - I'll mess around with the styles when I can get the code to run correctly.
When I print the DataFrame, it comes out ok, as seen below, but it's missing the first column header.
R HR RBI SB AVG ... QS SV+H K ERA WHIP
Democracy . 186.0 45.0 164.0 32.0 0.261 ... 18.0 15.0 244.0 2.17 1.05
Wassup Pham 181.0 55.0 198.0 20.0 0.263 ... 12.0 34.0 226.0 2.52 0.99
Myrtle Bea. 180.0 50.0 153.0 9.0 0.262 ... 17.0 21.0 236.0 3.33 1.13
The Rotter. 176.0 46.0 183.0 21.0 0.270 ... 25.0 13.0 275.0 2.41 0.85
Scranton S. 172.0 56.0 164.0 15.0 0.272 ... 24.0 18.0 265.0 2.45 1.01
New York N. 164.0 56.0 203.0 13.0 0.287 ... 28.0 0.0 297.0 2.84 1.05
Springfiel. 156.0 39.0 154.0 15.0 0.251 ... 11.0 21.0 236.0 3.65 1.18
Collective. 151.0 38.0 150.0 33.0 0.283 ... 10.0 25.0 214.0 2.41 1.05
Cron Job 146.0 33.0 145.0 20.0 0.244 ... 14.0 22.0 237.0 2.79 1.01
Patrick's . 142.0 37.0 162.0 19.0 0.252 ... 9.0 24.0 253.0 2.92 1.01
I'm thinking it's possible that the lack of a column header is causing the entire column to be lost when converting to a Dash DataTable, but I'm not sure what to do to fix it.
Here's my code, from the printing of the DataFrame, to the Dash app creation and layout, to running the code locally.
print(statsdf_transposed)
######################
app = Dash(__name__, external_stylesheets=[dbc.themes.LUX])
app.layout = html.Div([
html.H4('The Show - Season Stats'),
dash_table.DataTable(
id='stats_table',
columns=[{"name": i, "id": i}
for i in statsdf_transposed.columns],
data=statsdf_transposed.to_dict('records'),
)
])
if __name__ == '__main__':
app.run_server(debug=True)
Thank you in advance for any help this community could offer!

Related

In a Multi-index DataFrame, how to set new values to a subset of columns?

The following code is a sample DataFrame. How can I bulk assign/modify all the Temp numbers (say convert from deg c to deg f)?
index = pd.MultiIndex.from_product([[2013, 2014], [1, 2]],
names=['year', 'visit'])
columns = pd.MultiIndex.from_product([['Bob', 'Guido', 'Sue'],['Day', 'Night'], ['HR', 'Temp']])
# mock some data
data = np.round(np.random.randn(4, 12), 1)
data[:, ::2] *= 10
data += 37
# create the DataFrame
hd = pd.DataFrame(data, index=index, columns=columns)
print(hd,'\n')
Bob Guido Sue
Day Night Day Night Day Night
HR Temp HR Temp HR Temp HR Temp HR Temp HR Temp
year visit
2013 1 38.0 37.5 21.0 36.6 33.0 37.4 38.0 35.8 15.0 38.5 27.0 37.0
2 47.0 36.5 37.0 36.3 31.0 38.8 37.0 38.4 62.0 34.9 45.0 35.6
2014 1 51.0 35.5 41.0 35.9 26.0 36.7 33.0 36.3 18.0 34.6 39.0 38.0
2 46.0 37.6 29.0 37.3 42.0 37.0 31.0 37.0 47.0 37.3 30.0 36.0
filter the columns for Temp, modify, and reassign the values back:
hd
Bob Guido Sue
Day Night Day Night Day Night
HR Temp HR Temp HR Temp HR Temp HR Temp HR Temp
year visit
2013 1 35.0 38.2 45.0 38.1 38.0 36.7 43.0 40.0 38.0 37.5 35.0 37.8
2 31.0 36.7 30.0 37.7 43.0 36.6 38.0 37.5 33.0 38.2 42.0 35.8
2014 1 32.0 37.7 39.0 35.0 37.0 37.7 51.0 37.5 28.0 39.6 43.0 37.8
2 59.0 37.5 28.0 34.6 60.0 38.0 38.0 36.7 63.0 37.9 25.0 37.2
modified = hd.loc(axis=1)[:,:, 'Temp'].mul(1.8).add(32)
hd.update(modified)
hd
Bob Guido Sue
Day Night Day Night Day Night
HR Temp HR Temp HR Temp HR Temp HR Temp HR Temp
year visit
2013 1 35.0 100.76 45.0 100.58 38.0 98.06 43.0 104.00 38.0 99.50 35.0 100.04
2 31.0 98.06 30.0 99.86 43.0 97.88 38.0 99.50 33.0 100.76 42.0 96.44
2014 1 32.0 99.86 39.0 95.00 37.0 99.86 51.0 99.50 28.0 103.28 43.0 100.04
2 59.0 99.50 28.0 94.28 60.0 100.40 38.0 98.06 63.0 100.22 25.0 98.96

How to fill missing Rows in DataFrame with values in specific range?

I am missing some rows in a DataFrame, my range of date_number spans from 0 to 50. I would like to create the missing rows, in this case missing the range from 0 to 13 and from 29 to 50. I would like that every missing is filled with the following data:
date_number tag_id price weekday brand_id sales stock profit
XX.X 665237982.0 12. Y.Y 2123.0 0.0 0.0 0.00
date_number tag_id price weekday brand_id sales stock profit
14.0 665237982.0 12.95 0.0 2123.0 0.0 128.0 0.00
15.0 665237982.0 12.95 1.0 2123.0 9.0 106.0 116.55
16.0 665237982.0 12.95 2.0 2123.0 29.0 137.0 375.55
17.0 665237982.0 12.95 3.0 2123.0 24.0 88.0 310.80
18.0 665237982.0 12.95 4.0 2123.0 27.0 35.0 349.65
19.0 665237982.0 12.95 5.0 2123.0 2.0 2.0 25.90
21.0 665237982.0 12.95 0.0 2123.0 14.0 312.0 181.30
22.0 665237982.0 12.95 1.0 2123.0 12.0 455.0 155.40
23.0 665237982.0 12.95 2.0 2123.0 12.0 450.0 155.40
24.0 665237982.0 12.95 3.0 2123.0 8.0 450.0 103.60
25.0 665237982.0 12.95 4.0 2123.0 11.0 427.0 142.45
26.0 665237982.0 12.95 5.0 2123.0 9.0 401.0 116.55
27.0 665237982.0 12.95 6.0 2123.0 19.0 377.0 246.05
28.0 665237982.0 12.95 0.0 2123.0 12.0 343.0 155.40
29.0 665237982.0 12.95 1.0 2123.0 9.0 314.0 116.55

Python Pandas: Denormalize data from one data frame into another

I have a Pandas data frame which you might describe as “normalized”. For display purposes, I want to “de-normalize” the data. That is, I want to take some data spread across multiple key values which I want to put on the same row in the output records. Some records need to be summed as they are combined. (Aside: if anyone has a better term for this than “denormalization”, please make an edit to this question, or say so in the comments.)
I am working with a pandas data frame with many columns, so I will show you a simplified version below.
The following code sets up a (nearly) normalized source data frame. (Note that I am looking for advice on the second code block, and this code block is just to provide some context.) Similar to my actual data, there are some duplications in the identifying data, and some numbers to be summed:
import pandas as pd
dates = pd.date_range('20170701', periods=21)
datesA1 = pd.date_range('20170701', periods=11)
datesB1 = pd.date_range('20170705', periods=9)
datesA2 = pd.date_range('20170708', periods=10)
datesB2 = pd.date_range('20170710', periods=11)
datesC1 = pd.date_range('20170701', periods=5)
datesC2 = pd.date_range('20170709', periods=9)
cols=['Date','Type','Count']
df_A1 = pd.DataFrame({'Date':datesA1,
'Type':'Apples',
'Count': np.random.randint(30,size=11)})
df_A2 = pd.DataFrame({'Date':datesA2,
'Type':'Apples',
'Count': np.random.randint(30,size=10)})
df_B1 = pd.DataFrame({'Date':datesB1,
'Type':'Berries',
'Count': np.random.randint(30,size=9)})
df_B2 = pd.DataFrame({'Date':datesB2,
'Type':'Berries',
'Count': np.random.randint(30,size=11)})
df_C1 = pd.DataFrame({'Date':datesC1,
'Type':'Canteloupes',
'Count': np.random.randint(30,size=5)})
df_C2 = pd.DataFrame({'Date':datesC2,
'Type':'Canteloupes',
'Count': np.random.randint(30,size=9)})
frames = [df_A1, df_A2, df_B1, df_B2, df_C1, df_C2]
dat_fra_source = pd.concat(frames)
Further, the following code achieves my intention. The source data frame has multiple rows per date and type of fruit (A, B, and C). The destination data has a single row per day, with a sum of A, B, and C.
dat_fra_dest = pd.DataFrame(0, index=dates, columns=['Apples','Berries','Canteloupes'])
for index,row in dat_fra_source.iterrows():
dat_fra_dest.at[row['Date'],row['Type']]+=row['Count']
My question is if there is a cleaner way to do this: a way that doesn’t require the zero-initialization and/or a way that operates on the entire data frame instead of line-by-line. I am also skeptical that I have an efficient implementation. I’ll also note that while I am only dealing with “count” in the simplified example, I have additional columns in my real-world example. Think that for A, B, and C there is not only a count, but also a weight and a volume.
Option 1
dat_fra_source.groupby(['Date','Type']).sum().unstack().fillna(0)
Out[63]:
Count
Type Apples Berries Canteloupes
Date
2017-07-01 13.0 0.0 24.0
2017-07-02 18.0 0.0 16.0
2017-07-03 11.0 0.0 29.0
2017-07-04 13.0 0.0 7.0
2017-07-05 24.0 11.0 23.0
2017-07-06 6.0 4.0 0.0
2017-07-07 29.0 26.0 0.0
2017-07-08 31.0 19.0 0.0
2017-07-09 38.0 17.0 26.0
2017-07-10 57.0 54.0 1.0
2017-07-11 4.0 41.0 10.0
2017-07-12 16.0 28.0 23.0
2017-07-13 25.0 20.0 20.0
2017-07-14 19.0 6.0 15.0
2017-07-15 6.0 22.0 7.0
2017-07-16 16.0 0.0 5.0
2017-07-17 29.0 7.0 4.0
2017-07-18 0.0 21.0 0.0
2017-07-19 0.0 19.0 0.0
2017-07-20 0.0 8.0 0.0
Option 2
pd.pivot_table(dat_fra_source,index=['Date'],columns=['Type'],values='Count',aggfunc=sum).fillna(0)
Out[75]:
Type Apples Berries Canteloupes
Date
2017-07-01 13.0 0.0 24.0
2017-07-02 18.0 0.0 16.0
2017-07-03 11.0 0.0 29.0
2017-07-04 13.0 0.0 7.0
2017-07-05 24.0 11.0 23.0
2017-07-06 6.0 4.0 0.0
2017-07-07 29.0 26.0 0.0
2017-07-08 31.0 19.0 0.0
2017-07-09 38.0 17.0 26.0
2017-07-10 57.0 54.0 1.0
2017-07-11 4.0 41.0 10.0
2017-07-12 16.0 28.0 23.0
2017-07-13 25.0 20.0 20.0
2017-07-14 19.0 6.0 15.0
2017-07-15 6.0 22.0 7.0
2017-07-16 16.0 0.0 5.0
2017-07-17 29.0 7.0 4.0
2017-07-18 0.0 21.0 0.0
2017-07-19 0.0 19.0 0.0
2017-07-20 0.0 8.0 0.0
And assuming you have columns vol and weight
dat_fra_source['vol']=2
dat_fra_source['weight']=2
dat_fra_source.groupby(['Date','Type']).apply(lambda x: sum(x['vol']*x['weight']*x['Count'])).unstack().fillna(0)
Out[88]:
Type Apples Berries Canteloupes
Date
2017-07-01 52.0 0.0 96.0
2017-07-02 72.0 0.0 64.0
2017-07-03 44.0 0.0 116.0
2017-07-04 52.0 0.0 28.0
2017-07-05 96.0 44.0 92.0
2017-07-06 24.0 16.0 0.0
2017-07-07 116.0 104.0 0.0
2017-07-08 124.0 76.0 0.0
2017-07-09 152.0 68.0 104.0
2017-07-10 228.0 216.0 4.0
2017-07-11 16.0 164.0 40.0
2017-07-12 64.0 112.0 92.0
2017-07-13 100.0 80.0 80.0
2017-07-14 76.0 24.0 60.0
2017-07-15 24.0 88.0 28.0
2017-07-16 64.0 0.0 20.0
2017-07-17 116.0 28.0 16.0
2017-07-18 0.0 84.0 0.0
2017-07-19 0.0 76.0 0.0
2017-07-20 0.0 32.0 0.0
Use pd.crosstab:
pd.crosstab(dat_fra_source['Date'],
dat_fra_source['Type'],
dat_fra_source['Count'],
aggfunc='sum',
dropna=False).fillna(0)
Output:
Type Apples Berries Canteloupes
Date
2017-07-01 19.0 0.0 4.0
2017-07-02 25.0 0.0 4.0
2017-07-03 11.0 0.0 26.0
2017-07-04 27.0 0.0 8.0
2017-07-05 8.0 18.0 12.0
2017-07-06 10.0 11.0 0.0
2017-07-07 6.0 17.0 0.0
2017-07-08 10.0 5.0 0.0
2017-07-09 51.0 25.0 16.0
2017-07-10 31.0 23.0 21.0
2017-07-11 35.0 40.0 10.0
2017-07-12 16.0 30.0 9.0
2017-07-13 13.0 23.0 20.0
2017-07-14 21.0 26.0 27.0
2017-07-15 20.0 17.0 19.0
2017-07-16 12.0 4.0 2.0
2017-07-17 27.0 0.0 5.0
2017-07-18 0.0 5.0 0.0
2017-07-19 0.0 26.0 0.0
2017-07-20 0.0 6.0 0.0

Merging two Pandas series with duplicate datetime indices

I have two Pandas series (d1 and d2) indexed by datetime and each containing one column of data with both float and NaN. Both indices are at one-day intervals, although the time entries are inconsistent with many periods of missing days. d1 ranges from 1974-12-16 to 2002-01-30. d2 ranges from 1997-12-19 to 2017-07-06. The period from 1997-12-19 to 2002-01-30 contains many duplicate indices between the two series. The data for duplicated indices is sometimes the same value, different values, or one value and NaN.
I would like to combine these two series into one, prioritizing the data from d2 anytime there are duplicate indices (that is, replace all d1 data with d2 data anytime there is a duplicated index). What is the most efficient way to do this among the many Pandas tools available (merge, join, concatenate etc.)?
Here is an example of my data:
In [7]: print d1
fldDate
1974-12-16 19.0
1974-12-17 28.0
1974-12-18 24.0
1974-12-19 18.0
1974-12-20 17.0
1974-12-21 28.0
1974-12-22 28.0
1974-12-23 10.0
1974-12-24 6.0
1974-12-25 5.0
1974-12-26 12.0
1974-12-27 19.0
1974-12-28 22.0
1974-12-29 20.0
1974-12-30 16.0
1974-12-31 12.0
1975-01-01 12.0
1975-01-02 15.0
1975-01-03 14.0
1975-01-04 15.0
1975-01-05 18.0
1975-01-06 21.0
1975-01-07 22.0
1975-01-08 18.0
1975-01-09 20.0
1975-01-10 12.0
1975-01-11 8.0
1975-01-12 -2.0
1975-01-13 13.0
1975-01-14 24.0
...
2002-01-01 18.0
2002-01-02 16.0
2002-01-03 NaN
2002-01-04 24.0
2002-01-05 23.0
2002-01-06 15.0
2002-01-07 22.0
2002-01-08 34.0
2002-01-09 35.0
2002-01-10 29.0
2002-01-11 21.0
2002-01-12 24.0
2002-01-13 NaN
2002-01-14 18.0
2002-01-15 14.0
2002-01-16 10.0
2002-01-17 5.0
2002-01-18 7.0
2002-01-19 7.0
2002-01-20 7.0
2002-01-21 11.0
2002-01-22 NaN
2002-01-23 9.0
2002-01-24 8.0
2002-01-25 15.0
2002-01-26 NaN
2002-01-27 NaN
2002-01-28 18.0
2002-01-29 13.0
2002-01-30 13.0
Name: MaxTempMid, dtype: float64
In [8]: print d2
fldDate
1997-12-19 22.0
1997-12-20 14.0
1997-12-21 18.0
1997-12-22 16.0
1997-12-23 16.0
1997-12-24 10.0
1997-12-25 12.0
1997-12-26 12.0
1997-12-27 9.0
1997-12-28 12.0
1997-12-29 18.0
1997-12-30 23.0
1997-12-31 28.0
1998-01-01 26.0
1998-01-02 29.0
1998-01-03 27.0
1998-01-04 22.0
1998-01-05 19.0
1998-01-06 17.0
1998-01-07 14.0
1998-01-08 14.0
1998-01-09 14.0
1998-01-10 16.0
1998-01-11 20.0
1998-01-12 21.0
1998-01-13 19.0
1998-01-14 20.0
1998-01-15 16.0
1998-01-16 17.0
1998-01-17 20.0
...
2017-06-07 68.0
2017-06-08 71.0
2017-06-09 71.0
2017-06-10 59.0
2017-06-11 41.0
2017-06-12 57.0
2017-06-13 58.0
2017-06-14 36.0
2017-06-15 50.0
2017-06-16 58.0
2017-06-17 54.0
2017-06-18 53.0
2017-06-19 58.0
2017-06-20 68.0
2017-06-21 71.0
2017-06-22 71.0
2017-06-23 59.0
2017-06-24 61.0
2017-06-25 65.0
2017-06-26 68.0
2017-06-27 71.0
2017-06-28 60.0
2017-06-29 54.0
2017-06-30 48.0
2017-07-01 60.0
2017-07-02 68.0
2017-07-03 65.0
2017-07-04 73.0
2017-07-05 74.0
2017-07-06 77.0
Name: MaxTempMid, dtype: float64
Let's use, combine_first:
df2.combine_first(df1)
Output:
fldDate
1974-12-16 19.0
1974-12-17 28.0
1974-12-18 24.0
1974-12-19 18.0
1974-12-20 17.0
1974-12-21 28.0
1974-12-22 28.0
1974-12-23 10.0
1974-12-24 6.0
1974-12-25 5.0
1974-12-26 12.0
1974-12-27 19.0
1974-12-28 22.0
1974-12-29 20.0
1974-12-30 16.0
1974-12-31 12.0
1975-01-01 12.0
1975-01-02 15.0
1975-01-03 14.0
1975-01-04 15.0
1975-01-05 18.0
1975-01-06 21.0
1975-01-07 22.0
1975-01-08 18.0
1975-01-09 20.0
1975-01-10 12.0
1975-01-11 8.0
1975-01-12 -2.0
1975-01-13 13.0
1975-01-14 24.0
...
2017-06-07 68.0
2017-06-08 71.0
2017-06-09 71.0
2017-06-10 59.0
2017-06-11 41.0
2017-06-12 57.0
2017-06-13 58.0
2017-06-14 36.0
2017-06-15 50.0
2017-06-16 58.0
2017-06-17 54.0
2017-06-18 53.0
2017-06-19 58.0
2017-06-20 68.0
2017-06-21 71.0
2017-06-22 71.0
2017-06-23 59.0
2017-06-24 61.0
2017-06-25 65.0
2017-06-26 68.0
2017-06-27 71.0
2017-06-28 60.0
2017-06-29 54.0
2017-06-30 48.0
2017-07-01 60.0
2017-07-02 68.0
2017-07-03 65.0
2017-07-04 73.0
2017-07-05 74.0
2017-07-06 77.0

Transposing dataframe column, creating different rows per day

I have a dataframe that has one column and a timestamp index including anywhere from 2 to 7 days:
kWh
Timestamp
2017-07-08 06:00:00 0.00
2017-07-08 07:00:00 752.75
2017-07-08 08:00:00 1390.20
2017-07-08 09:00:00 2027.65
2017-07-08 10:00:00 2447.27
.... ....
2017-07-12 20:00:00 167.64
2017-07-12 21:00:00 0.00
2017-07-12 22:00:00 0.00
2017-07-12 23:00:00 0.00
I would like to transpose the kWh column so that one day's worth of values (hourly granularity, so 24 values/day) fill up a row. And the next row is the next day of values and so on (so five days of forecasted data has five rows with 24 elements each).
Because my query of the data comes in the vertical format, and my regression and subsequent analysis already occurs in the vertical format, I don't want to change the process too much and am hoping there is a simpler way. I have tried giving a multi-index with df.index.hour and then using unstack(), but I get a huge dataframe with NaN values everywhere.
Is there an elegant way to do this?
If we start from a frame like
In [25]: df = pd.DataFrame({"kWh": [1]}, index=pd.date_range("2017-07-08",
"2017-07-12", freq="1H").rename("Timestamp")).cumsum()
In [26]: df.head()
Out[26]:
kWh
Timestamp
2017-07-08 00:00:00 1
2017-07-08 01:00:00 2
2017-07-08 02:00:00 3
2017-07-08 03:00:00 4
2017-07-08 04:00:00 5
we can make date and hour columns and then pivot:
In [27]: df["date"] = df.index.date
In [28]: df["hour"] = df.index.hour
In [29]: df.pivot(index="date", columns="hour", values="kWh")
Out[29]:
hour 0 1 2 3 4 5 6 7 8 9 ... \
date ...
2017-07-08 1.0 2.0 3.0 4.0 5.0 6.0 7.0 8.0 9.0 10.0 ...
2017-07-09 25.0 26.0 27.0 28.0 29.0 30.0 31.0 32.0 33.0 34.0 ...
2017-07-10 49.0 50.0 51.0 52.0 53.0 54.0 55.0 56.0 57.0 58.0 ...
2017-07-11 73.0 74.0 75.0 76.0 77.0 78.0 79.0 80.0 81.0 82.0 ...
2017-07-12 97.0 NaN NaN NaN NaN NaN NaN NaN NaN NaN ...
hour 14 15 16 17 18 19 20 21 22 23
date
2017-07-08 15.0 16.0 17.0 18.0 19.0 20.0 21.0 22.0 23.0 24.0
2017-07-09 39.0 40.0 41.0 42.0 43.0 44.0 45.0 46.0 47.0 48.0
2017-07-10 63.0 64.0 65.0 66.0 67.0 68.0 69.0 70.0 71.0 72.0
2017-07-11 87.0 88.0 89.0 90.0 91.0 92.0 93.0 94.0 95.0 96.0
2017-07-12 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
[5 rows x 24 columns]
Not sure why your MultiIndex code doesn't work.
I'm assuming your MultiIndex code is something along the lines, which gives the same output as the pivot:
In []
df = pd.DataFrame({"kWh": [1]}, index=pd.date_range("2017-07-08",
"2017-07-12", freq="1H").rename("Timestamp")).cumsum()
df.index = pd.MultiIndex.from_arrays([df.index.date, df.index.hour], names=['Date','Hour'])
df.unstack()
Out[]:
kWh ... \
Hour 0 1 2 3 4 5 6 7 8 9 ...
Date ...
2017-07-08 1.0 2.0 3.0 4.0 5.0 6.0 7.0 8.0 9.0 10.0 ...
2017-07-09 25.0 26.0 27.0 28.0 29.0 30.0 31.0 32.0 33.0 34.0 ...
2017-07-10 49.0 50.0 51.0 52.0 53.0 54.0 55.0 56.0 57.0 58.0 ...
2017-07-11 73.0 74.0 75.0 76.0 77.0 78.0 79.0 80.0 81.0 82.0 ...
2017-07-12 97.0 NaN NaN NaN NaN NaN NaN NaN NaN NaN ...
Hour 14 15 16 17 18 19 20 21 22 23
Date
2017-07-08 15.0 16.0 17.0 18.0 19.0 20.0 21.0 22.0 23.0 24.0
2017-07-09 39.0 40.0 41.0 42.0 43.0 44.0 45.0 46.0 47.0 48.0
2017-07-10 63.0 64.0 65.0 66.0 67.0 68.0 69.0 70.0 71.0 72.0
2017-07-11 87.0 88.0 89.0 90.0 91.0 92.0 93.0 94.0 95.0 96.0
2017-07-12 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
[5 rows x 24 columns]
​

Categories

Resources