Pandas converting timestamp and monthly summary - python

I have several .csv files which I am importing via Pandas and then work out a summary of the data (min, max, mean), ideally weekly and monthly reports. I have the following code, but just do not seem to get the month summary to work, I am sure the problem is with the timestamp conversion.
What am I doing wrong?
import pandas as pd
import numpy as np
#Format of the data that is been imported
#2017-05-11 18:29:14+00:00,264.0,987.99,26.5,23.70,512.0,11.763,52.31
df = pd.read_csv('data.csv')
df['timestamp'] = pd.to_datetime(df['time'], format='%Y-%m-%d %H:%M:%S')
print 'month info'
print [g for n, g in df.groupby(pd.Grouper(key='timestamp',freq='M'))]
print(data.groupby('timestamp')['light'].mean())

IIUC, you almost have it, and your datetime conversion is fine. Here is an example:
Starting from a dataframe like this (which is your example row, duplicated with slight modifications):
>>> df
time x y z a b c d
0 2017-05-11 18:29:14+00:00 264.0 947.99 24.5 53.7 511.0 11.463 12.31
1 2017-05-15 18:29:14+00:00 265.0 957.99 25.5 43.7 512.0 11.563 22.31
2 2017-05-21 18:29:14+00:00 266.0 967.99 26.5 33.7 513.0 11.663 32.31
3 2017-06-11 18:29:14+00:00 267.0 977.99 26.5 23.7 514.0 11.763 42.31
4 2017-06-22 18:29:14+00:00 268.0 997.99 27.5 13.7 515.0 11.800 52.31
You can do what you did before with your datetime:
df['timestamp'] = pd.to_datetime(df['time'], format='%Y-%m-%d %H:%M:%S')
And then get your summaries either separately:
monthly_mean = df.groupby(pd.Grouper(key='timestamp',freq='M')).mean()
monthly_max = df.groupby(pd.Grouper(key='timestamp',freq='M')).max()
monthly_min = df.groupby(pd.Grouper(key='timestamp',freq='M')).min()
weekly_mean = df.groupby(pd.Grouper(key='timestamp',freq='W')).mean()
weekly_min = df.groupby(pd.Grouper(key='timestamp',freq='W')).min()
weekly_max = df.groupby(pd.Grouper(key='timestamp',freq='W')).max()
# Examples:
>>> monthly_mean
x y z a b c d
timestamp
2017-05-31 265.0 957.99 25.5 43.7 512.0 11.5630 22.31
2017-06-30 267.5 987.99 27.0 18.7 514.5 11.7815 47.31
>>> weekly_mean
x y z a b c d
timestamp
2017-05-14 264.0 947.99 24.5 53.7 511.0 11.463 12.31
2017-05-21 265.5 962.99 26.0 38.7 512.5 11.613 27.31
2017-05-28 NaN NaN NaN NaN NaN NaN NaN
2017-06-04 NaN NaN NaN NaN NaN NaN NaN
2017-06-11 267.0 977.99 26.5 23.7 514.0 11.763 42.31
2017-06-18 NaN NaN NaN NaN NaN NaN NaN
2017-06-25 268.0 997.99 27.5 13.7 515.0 11.800 52.31
Or aggregate them all together to get a multi-indexed dataframe with your summaries:
monthly_summary = df.groupby(pd.Grouper(key='timestamp',freq='M')).agg(['mean', 'min', 'max'])
weekly_summary = df.groupby(pd.Grouper(key='timestamp',freq='W')).agg(['mean', 'min', 'max'])
# Example of summary of row 'x':
>>> monthly_summary['x']
mean min max
timestamp
2017-05-31 265.0 264.0 266.0
2017-06-30 267.5 267.0 268.0
>>> weekly_summary['x']
mean min max
timestamp
2017-05-14 264.0 264.0 264.0
2017-05-21 265.5 265.0 266.0
2017-05-28 NaN NaN NaN
2017-06-04 NaN NaN NaN
2017-06-11 267.0 267.0 267.0
2017-06-18 NaN NaN NaN
2017-06-25 268.0 268.0 268.0

Related

ValueError: Invalid fill method. Expecting pad (ffill) or backfill (bfill). Got nearest

I have this df:
Week U.S. 30 yr FRM U.S. 15 yr FRM
0 2014-12-31 3.87 3.15
1 2015-01-01 NaN NaN
2 2015-01-02 NaN NaN
3 2015-01-03 NaN NaN
4 2015-01-04 NaN NaN
... ... ... ...
2769 2022-07-31 NaN NaN
2770 2022-08-01 NaN NaN
2771 2022-08-02 NaN NaN
2772 2022-08-03 NaN NaN
2773 2022-08-04 4.99 4.26
And when I try to run this interpolation:
pmms_df.interpolate(method = 'nearest', inplace = True)
I get ValueError: Invalid fill method. Expecting pad (ffill) or backfill (bfill). Got nearest
I read in this post that pandas interpolate doesn't do well with the time columns, so I tried this:
pmms_df[['U.S. 30 yr FRM', 'U.S. 15 yr FRM']].interpolate(method = 'nearest', inplace = True)
but the output is exactly the same as before the interpolation.
It may not work great with date columns, but it works well with a datetime index, which is probably what you should be using here:
df = df.set_index('Week')
df = df.interpolate(method='nearest')
print(df)
# Output:
U.S. 30 yr FRM U.S. 15 yr FRM
Week
2014-12-31 3.87 3.15
2015-01-01 3.87 3.15
2015-01-02 3.87 3.15
2015-01-03 3.87 3.15
2015-01-04 3.87 3.15
2022-07-31 4.99 4.26
2022-08-01 4.99 4.26
2022-08-02 4.99 4.26
2022-08-03 4.99 4.26
2022-08-04 4.99 4.26

How to align yticklabels when combining a barplot with heatmap

I have similar problems as this question; I am trying to combine three plots in Seaborn, but the labels on my y-axis are not aligned with the bars.
My code (now a working copy-paste example):
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
from matplotlib.colors import LogNorm
### Generate example data
np.random.seed(123)
year = [2018, 2019, 2020, 2021]
task = [x + 2 for x in range(18)]
student = [x for x in range(200)]
amount = [x + 10 for x in range(90)]
violation = [letter for letter in "thisisjustsampletextforlabels"] # one letter labels
df_example = pd.DataFrame({
# some ways to create random data
'year':np.random.choice(year,500),
'task':np.random.choice(task,500),
'violation':np.random.choice(violation, 500),
'amount':np.random.choice(amount, 500),
'student':np.random.choice(student, 500)
})
### My code
temp = df_example.groupby(["violation"])["amount"].sum().sort_values(ascending = False).reset_index()
total_violations = temp["amount"].sum()
sns.set(font_scale = 1.2)
f, axs = plt.subplots(1,3,
figsize=(5,5),
sharey="row",
gridspec_kw=dict(width_ratios=[3,1.5,5]))
# Plot frequency
df1 = df_example.groupby(["year","violation"])["amount"].sum().sort_values(ascending = False).reset_index()
frequency = sns.barplot(data = df1, y = "violation", x = "amount", log = True, ax=axs[0])
# Plot percent
df2 = df_example.groupby(["violation"])["amount"].sum().sort_values(ascending = False).reset_index()
total_violations = df2["amount"].sum()
percent = sns.barplot(x='amount', y='violation', estimator=lambda x: sum(x) / total_violations * 100, data=df2, ax=axs[1])
# Pivot table and plot heatmap
df_heatmap = df_example.groupby(["violation", "task"])["amount"].sum().sort_values(ascending = False).reset_index()
df_heatmap_pivot = df_heatmap.pivot("violation", "task", "amount")
df_heatmap_pivot = df_heatmap_pivot.reindex(index=df_heatmap["violation"].unique())
heatmap = sns.heatmap(df_heatmap_pivot, fmt = "d", cmap="Greys", norm=LogNorm(), ax=axs[2])
plt.subplots_adjust(top=1)
axs[2].set_facecolor('xkcd:white')
axs[2].set(ylabel="",xlabel="Task")
axs[0].set_xlabel('Total amount of violations per year')
axs[1].set_xlabel('Percent (%)')
axs[1].set_ylabel('')
axs[0].set_ylabel('Violation')
The result can be seen here:
The y-labels are aligned according to my last plot, the heatmap. However, the bars in the bar plots are clipping at the top, and are not aligned to the labels. I just have to nudge the bars in the barplot -- but how? I've been looking through the documentation, but I feel quite clueless as of now.
See here that none of the y-axis ticklabels are aligned because multiple dataframes are used for plotting. It will be better to create a single dataframe, violations, with the aggregated data to be plotted. Start with the sum of amounts by violation, and then add a new percent column. This will insure the two bar plots have the same y-axis.
Instead of using .groupby and then .pivot, to create df_heatmap_pivot, use .pivot_table, and then reindex using violations.violation.
Tested in python 3.10, pandas 1.4.3, matplotlib 3.5.1, seaborn 0.11.2
DataFrames and Imports
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from matplotlib.colors import LogNorm
# Generate example data
year = [2018, 2019, 2020, 2021]
task = [x + 2 for x in range(18)]
student = [x for x in range(200)]
amount = [x + 10 for x in range(90)]
violation = list("thisisjustsampletextforlabels") # one letter labels
np.random.seed(123)
df_example = pd.DataFrame({name: np.random.choice(group, 500) for name, group in
zip(['year', 'task', 'violation', 'amount', 'student'],
[year, task, violation, amount, student])})
# organize all of the data
# violations frequency
violations = df_example.groupby(["violation"])["amount"].sum().sort_values(ascending=False).reset_index()
total_violations = violations["amount"].sum()
# add percent
violations['percent'] = violations.amount.div(total_violations).mul(100).round(2)
# Use .pivot_table to create the pivot table
df_heatmap_pivot = df_example.pivot_table(index='violation', columns='task', values='amount', aggfunc='sum')
# Set the index to match the plot order of the 'violation' column
df_heatmap_pivot = df_heatmap_pivot.reindex(index=violations.violation)
Plotting
Using sharey='row' is causing the alignment problem. Use sharey=False, and remove the yticklabels from axs[1] and axs[2], with axs[1 or 2].set_yticks([]).
See How to add value labels on a bar chart for additional details and examples using .bar_label.
# set seaborn plot format
sns.set(font_scale=1.2)
# create the figure and set sharey=False
f, axs = plt.subplots(1, 3, figsize=(12, 12), sharey=False, gridspec_kw=dict(width_ratios=[3,1.5,5]))
# Plot frequency
sns.barplot(data=violations, x="amount", y="violation", log=True, ax=axs[0])
# Plot percent
sns.barplot(data=violations, x='percent', y='violation', ax=axs[1])
# add the bar labels
axs[1].bar_label(axs[1].containers[0], fmt='%.2f%%', label_type='edge', padding=3)
# add extra space for the annotation
axs[1].margins(x=1.3)
# plot the heatmap
heatmap = sns.heatmap(df_heatmap_pivot, fmt = "d", cmap="Greys", norm=LogNorm(), ax=axs[2])
# additional formatting
axs[2].set_facecolor('xkcd:white')
axs[2].set(ylabel="", xlabel="Task")
axs[0].set_xlabel('Total amount of violations per year')
axs[1].set_xlabel('Percent (%)')
axs[1].set_ylabel('')
axs[0].set_ylabel('Violation')
# remove yticks / labels
axs[1].set_yticks([])
_ = axs[2].set_yticks([])
Comment out the last two lines to verify the yticklabels are aligned for each axs.
DataFrame Views
df_example.head()
year task violation amount student
0 2020 2 i 84 59
1 2019 2 u 12 182
2 2020 5 s 20 9
3 2020 11 u 56 163
4 2018 17 t 59 125
violations
violation amount percent
0 s 4869 17.86
1 l 3103 11.38
2 t 3044 11.17
3 e 2634 9.66
4 a 2177 7.99
5 i 2099 7.70
6 h 1275 4.68
7 f 1232 4.52
8 b 1191 4.37
9 m 1155 4.24
10 o 1075 3.94
11 p 763 2.80
12 r 762 2.80
13 j 707 2.59
14 u 595 2.18
15 x 578 2.12
df_heatmap_pivot
task 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19
violation
s 62.0 36.0 263.0 273.0 191.0 250.0 556.0 239.0 230.0 188.0 185.0 516.0 249.0 331.0 212.0 219.0 458.0 411.0
l 83.0 245.0 264.0 451.0 155.0 314.0 98.0 125.0 310.0 117.0 21.0 99.0 98.0 50.0 40.0 268.0 192.0 173.0
t 212.0 255.0 45.0 141.0 74.0 135.0 52.0 202.0 107.0 128.0 158.0 NaN 261.0 137.0 339.0 207.0 362.0 229.0
e 215.0 315.0 NaN 116.0 213.0 165.0 130.0 194.0 56.0 355.0 75.0 NaN 118.0 189.0 160.0 177.0 79.0 77.0
a 135.0 NaN 165.0 156.0 204.0 115.0 77.0 65.0 80.0 143.0 83.0 146.0 21.0 29.0 285.0 72.0 116.0 285.0
i 209.0 NaN 20.0 187.0 83.0 136.0 24.0 132.0 257.0 56.0 201.0 52.0 136.0 226.0 104.0 145.0 91.0 40.0
h 27.0 NaN 255.0 NaN 99.0 NaN 71.0 53.0 100.0 89.0 NaN 106.0 NaN 170.0 86.0 79.0 140.0 NaN
f 75.0 23.0 99.0 NaN 26.0 103.0 NaN 185.0 99.0 145.0 NaN 63.0 64.0 29.0 114.0 141.0 38.0 28.0
b 44.0 70.0 56.0 12.0 55.0 14.0 158.0 130.0 NaN 11.0 21.0 NaN 52.0 137.0 162.0 NaN 231.0 38.0
m 86.0 NaN NaN 147.0 74.0 131.0 49.0 180.0 94.0 16.0 NaN 88.0 NaN NaN NaN 51.0 161.0 78.0
o 109.0 NaN 51.0 NaN NaN NaN 20.0 139.0 149.0 NaN 101.0 60.0 NaN 143.0 39.0 73.0 10.0 181.0
p 16.0 NaN 197.0 50.0 87.0 NaN 88.0 NaN 11.0 162.0 NaN 14.0 NaN 78.0 45.0 NaN NaN 15.0
r NaN 85.0 73.0 40.0 NaN NaN 68.0 77.0 NaN 26.0 122.0 105.0 NaN 98.0 NaN NaN NaN 68.0
j NaN 70.0 NaN NaN 73.0 76.0 NaN 150.0 NaN NaN NaN 81.0 NaN 97.0 97.0 63.0 NaN NaN
u 174.0 45.0 NaN NaN 32.0 NaN NaN 86.0 30.0 56.0 13.0 NaN 24.0 NaN NaN 69.0 54.0 12.0
x 69.0 29.0 NaN 106.0 NaN 43.0 NaN NaN NaN 97.0 56.0 29.0 149.0 NaN NaN NaN NaN NaN

Adding to Pandas DataFrame using timestamps for index creates new columns

I have a script that reads data from a CSV and I want to append new data to the DF as it becomes available. Unfortunately, when I do that, I always end up with new columns. The DF from the CSV looks like this when I print()
df = pd.read_csv(filename, index_col=0, parse_dates=True)
Temp RH
Time
2021-05-17 11:08:34 51.08 77.9
2021-05-17 11:10:30 51.08 77.0
2021-05-17 11:10:35 50.72 71.9
2021-05-17 11:10:41 50.72 71.8
2021-05-17 11:12:19 50.72 71.6
... ... ...
2021-05-24 17:13:57 55.22 70.2
2021-05-24 17:14:02 55.22 69.6
2021-05-24 17:14:08 55.22 68.1
2021-05-24 17:14:18 54.86 66.9
2021-05-24 17:14:29 54.68 69.3
I use the following to create a fake new df for testing
timeStamp = datetime.now()
timeStamp = timeStamp.strftime("%m-%d-%Y %H:%M:%S")
t = 51.06
h = 69.3
data = {'Temp': t, 'RH': h}
newDF = pd.DataFrame(data, index = pd.to_datetime([timeStamp]) )
print(newDF)
which gives me
Temp RH
2021-05-24 17:28:32 51.06 69.3
Here is the output when I call append()
print(df.append([df, pd.DataFrame(newDF)], ignore_index = False))
Temp RH Temp RH
2021-05-17 11:08:34 51.08 77.9 NaN NaN
2021-05-17 11:10:30 51.08 77.0 NaN NaN
2021-05-17 11:10:35 50.72 71.9 NaN NaN
2021-05-17 11:10:41 50.72 71.8 NaN NaN
2021-05-17 11:12:19 50.72 71.6 NaN NaN
... ... ... ... ...
2021-05-24 17:14:02 55.22 69.6 NaN NaN
2021-05-24 17:14:08 55.22 68.1 NaN NaN
2021-05-24 17:14:18 54.86 66.9 NaN NaN
2021-05-24 17:14:29 54.68 69.3 NaN NaN
2021-05-24 17:28:32 NaN NaN 51.06 69.3
[223293 rows x 4 columns]
and concat()
df1 = pd.concat([df, newDF], ignore_index=False)
print(df1)
Temp RH Temp RH
2021-05-17 11:08:34 51.08 77.9 NaN NaN
2021-05-17 11:10:30 51.08 77.0 NaN NaN
2021-05-17 11:10:35 50.72 71.9 NaN NaN
2021-05-17 11:10:41 50.72 71.8 NaN NaN
2021-05-17 11:12:19 50.72 71.6 NaN NaN
... ... ... ... ...
2021-05-24 17:14:02 55.22 69.6 NaN NaN
2021-05-24 17:14:08 55.22 68.1 NaN NaN
2021-05-24 17:14:18 54.86 66.9 NaN NaN
2021-05-24 17:14:29 54.68 69.3 NaN NaN
2021-05-24 17:28:32 NaN NaN 51.06 69.3
[111647 rows x 4 columns]
Instead of
print(df.append([df, pd.DataFrame(newDF)], ignore_index = False))
Which I believe is keeping the columns of each unique dataframe, just call append on the original dataframe itself.
Try
df = df.append(newDF, ignore_index = False)

Calculate rolling average for all columns pandas

I have the below dataframe:
df = pd.DataFrame({'a': [2.85,3.11,3.3,3.275,np.NaN,4.21], 'b': [3.65,3.825,3.475,np.NaN,4.10,2.73],
'c': [4.3,3.08,np.NaN,2.40, 3.33, 2.48]}, index=pd.date_range('2019-01-01', periods=6,
freq='M'))
This gives the dataframe as below:
a b c
2019-01-31 2.850 3.650 4.30
2019-02-28 3.110 3.825 3.08
2019-03-31 3.300 3.475 NaN
2019-04-30 3.275 NaN 2.40
2019-05-31 NaN 4.100 3.33
2019-06-30 4.210 2.730 2.48
Expected:
a b c
2019-01-31 2.850 3.650 4.30
2019-02-28 3.110 3.825 3.08
2019-03-31 3.300 3.475 **3.69**
2019-04-30 3.275 **3.650** 2.40
2019-05-31 **3.220** 4.100 3.33
2019-06-30 4.210 2.730 2.48
I want to replace the NaN values with the 3 month rolling average. How should I got about this?
If you take NaNs as 0 into your means, can do:
df.fillna(0,inplace=True)
df.rolling(3).mean()
This will give you:
a b c
2019-01-31 NaN NaN NaN
2019-02-28 NaN NaN NaN
2019-03-31 3.086667 3.650000 2.460000
2019-04-30 3.228333 2.433333 1.826667
2019-05-31 2.191667 2.525000 1.910000
2019-06-30 2.495000 2.276667 2.736667

How to group by level 0 and describe in a multi index and level dataframe (pandas)?

Here is (file) a multi index and level dataframe. Loading the dataframe from a csv:
import pandas as pd
df = pd.read_csv('./enviar/only-bh-extreme-events-satellite.csv'
,index_col=[0,1,2,3,4]
,header=[0,1,2,3]
,skipinitialspace=True
,tupleize_cols=True
)
df.columns = pd.MultiIndex.from_tuples(df.columns)
print(df)
ci \
1
1
00h 06h 12h 18h
wsid lat lon start prcp_24
329 -43.969397 -19.883945 2007-03-18 10:00:00 72.0 NaN NaN NaN NaN
2007-03-20 10:00:00 104.4 NaN NaN NaN NaN
2007-10-18 23:00:00 92.8 NaN NaN NaN NaN
2007-12-21 00:00:00 60.4 NaN NaN NaN NaN
2008-01-19 18:00:00 53.0 NaN NaN NaN NaN
2008-04-05 01:00:00 80.8 0.0 0.0 0.0 0.0
2008-10-31 17:00:00 101.8 NaN NaN NaN NaN
2008-11-01 04:00:00 82.0 NaN NaN NaN NaN
2008-12-29 00:00:00 57.8 NaN NaN NaN NaN
2009-03-28 10:00:00 72.4 NaN NaN NaN NaN
2009-10-07 02:00:00 57.8 NaN NaN NaN NaN
2009-10-08 00:00:00 83.8 NaN NaN NaN NaN
2009-11-28 16:00:00 84.4 NaN NaN NaN NaN
2009-12-18 04:00:00 51.8 NaN NaN NaN NaN
2009-12-28 00:00:00 96.4 NaN NaN NaN NaN
2010-01-06 05:00:00 74.2 NaN NaN NaN NaN
2011-12-18 00:00:00 113.6 NaN NaN NaN NaN
2011-12-19 00:00:00 90.6 NaN NaN NaN NaN
2012-11-15 07:00:00 85.8 NaN NaN NaN NaN
2013-10-17 00:00:00 52.4 NaN NaN NaN NaN
2014-04-01 22:00:00 72.0 0.0 0.0 0.0 0.0
2014-10-20 06:00:00 56.6 NaN NaN NaN NaN
2014-12-13 09:00:00 104.4 NaN NaN NaN NaN
2015-02-09 00:00:00 62.0 NaN NaN NaN NaN
2015-02-16 19:00:00 56.8 NaN NaN NaN NaN
2015-05-06 17:00:00 50.8 0.0 0.0 0.0 0.0
2016-02-26 00:00:00 52.2 NaN NaN NaN NaN
343 -44.416883 -19.885398 2008-08-30 21:00:00 50.4 0.0 0.0 0.0 0.0
2009-02-01 01:00:00 53.8 NaN NaN NaN NaN
2010-03-22 00:00:00 51.4 NaN NaN NaN NaN
2011-11-12 21:00:00 57.8 NaN NaN NaN NaN
2011-11-25 22:00:00 107.6 NaN NaN NaN NaN
2012-12-28 20:00:00 94.0 NaN NaN NaN NaN
2013-10-16 22:00:00 50.8 NaN NaN NaN NaN
2014-11-06 21:00:00 55.2 NaN NaN NaN NaN
2015-01-24 00:00:00 80.0 NaN NaN NaN NaN
2015-01-27 00:00:00 52.8 NaN NaN NaN NaN
370 -43.958651 -19.980034 2015-01-28 23:00:00 50.4 NaN NaN NaN NaN
2015-01-29 00:00:00 50.6 NaN NaN NaN NaN
I'm trying to describe grouping by level (0), variables ci, d, r, z... I like to get the count, max, min, std, etc...
When I tried df.describe() I got not grouping by level 0. So I expected:
ci cc z r -> Level 0
count 39.000000 39.000000 39.000000 39.000000
mean 422577.032051 422025.595353 421672.402244 422449.004808
std 144740.869473 144550.040108 144425.167173 144692.422425
min 0.000000 0.000000 0.000000 0.000000
25% 467962.437500 467512.156250 467915.437500 468552.750000
50% 470644.687500 469924.468750 469772.312500 470947.468750
75% 472557.875000 471953.828125 471156.250000 472279.937500
max 473988.062500 473269.187500 472358.125000 473675.812500
I had created this helper function:
def format_percentiles(percentiles):
percentiles = np.asarray(percentiles)
percentiles = 100 * percentiles
int_idx = (percentiles.astype(int) == percentiles)
if np.all(int_idx):
out = percentiles.astype(int).astype(str)
return [i + '%' for i in out]
And this my own describe function:
import numpy as np
from functools import reduce
def describe_customized(df):
_df = pd.DataFrame()
data = []
variables = list(set(df.columns.get_level_values(0)))
variables.sort()
for var in variables:
idx = pd.IndexSlice
values = df.loc[:, idx[[var]]].values.tolist() #get all values from a specif variable
z = reduce(lambda x,y: x+y,values) #flat a list of list
data.append(pd.Series(z,name=var))
#return data
for series in data:
percentiles = np.array([0.25, 0.5, 0.75])
formatted_percentiles = format_percentiles(percentiles)
stat_index = (['count', 'mean', 'std', 'min'] + formatted_percentiles + ['max'])
d = ([series.count(), series.mean(), series.std(), series.min()] +
[series.quantile(x) for x in percentiles] + [series.max()])
s = pd.Series(d, index=stat_index, name=series.name)
_df = pd.concat([_df,s], axis=1)
return _df
dd = describe_customized(df)
Result:
al asn cc chnk ci ciwc \
25% 0.130846 0.849998 0.000000 0.018000 0.0 0.000000e+00
50% 0.131369 0.849999 0.000000 0.018000 0.0 0.000000e+00
75% 0.134000 0.849999 0.000000 0.018000 0.0 0.000000e+00
count 624.000000 624.000000 23088.000000 624.000000 64.0 2.308800e+04
max 0.137495 0.849999 1.000000 0.018006 0.0 5.576574e-04
mean 0.119082 0.762819 0.022013 0.016154 0.0 8.247306e-07
min 0.000000 0.000000 0.000000 0.000000 0.0 0.000000e+00
std 0.040338 0.258087 0.098553 0.005465 0.0 8.969210e-06
I created a function that returns a new dataframe with the statistics of the variables for a level of your choice:
def describe_levels(df,level):
df_des = pd.DataFrame(
index=df.columns.levels[0],
columns=['count','mean','std','min','25','50','75','max']
)
for index in df_des.index:
df_des.loc[index,'count'] = len(df[index]['1'][level])
df_des.loc[index,'mean'] = df[index]['1'][level].mean().mean()
df_des.loc[index,'std'] = df[index]['1'][level].std().mean()
df_des.loc[index,'min'] = df[index]['1'][level].min().mean()
df_des.loc[index,'max'] = df[index]['1'][level].max().mean()
df_des.loc[index,'25'] = df[index]['1'][level].quantile(q=0.25).mean()
df_des.loc[index,'50'] = df[index]['1'][level].quantile(q=0.5).mean()
df_des.loc[index,'75'] = df[index]['1'][level].quantile(q=0.75).mean()
return df_des
For example, I called:
describe_levels(df,'1').T
See here the result for pressure level 1:

Categories

Resources