How to convert the long to wide format in Pandas dataframe? - python

I am having dataframe df in below format
Date TLRA_CAPE TLRA_Pct B_CAPE B_Pct RC_CAPE RC_Pct
1/1/2000 10 0.20 30 0.40 50 0.60
2/1/2000 15 0.25 35 0.45 55 0.65
3/1/2000 17 0.27 37 0.47 57 0.6
I need to convert into below format
Date Variable CAPE Pct
1/1/2000 TLRA 10 0.20
2/1/2000 TLRA 15 0.25
3/1/2000 TLRA 17 0.27
1/1/2000 B 30 0.40
2/1/2000 B 35 0.45
3/1/2000 B 37 0.47
1/1/2000 RC 50 0.60
2/1/2000 RC 55 0.65
3/1/2000 RC 57 0.6
I am struggling to convert to required format. I tried using pd.melt , pd.pivot but those are not working.

After change of your columns you can do with wide_to_long, and you have both PCT and Pct, I assumed that is typo, if not , do df.columns=df.columns.str.upper()
df=df.set_index('Date')
df.columns=df.columns.str.split('_').map(lambda x : '_'.join(x[::-1]))
pd.wide_to_long(df.reset_index(),['CAPE','Pct'],i='Date',j='Variable',sep='_',suffix='\w+')
Out[63]:
CAPE Pct
Date Variable
1/1/2000 TLRA 10 0.20
2/1/2000 TLRA 15 0.25
3/1/2000 TLRA 17 0.27
1/1/2000 B 30 0.40
2/1/2000 B 35 0.45
3/1/2000 B 37 0.47
1/1/2000 RC 50 0.60
2/1/2000 RC 55 0.65
3/1/2000 RC 57 0.60

Related

Evaluate monthly fraction of yearly data - Python [duplicate]

This question already has answers here:
Pandas percentage of total with groupby
(16 answers)
Closed 10 months ago.
I have a pandas dataframe as:
ID
Date
Value
A
1/1/2000
5
A
2/1/2000
10
A
3/1/2000
20
A
4/1/2000
10
B
1/1/2000
100
B
2/1/2000
200
B
3/1/2000
300
B
4/1/2000
400
How do I evaluate the monthly fraction of the total yearly value for each ID as the fourth column?
ID
Date
Value
Fraction
A
1/1/2000
5
0.11
A
2/1/2000
10
0.22
A
3/1/2000
20
0.44
A
4/1/2000
10
0.11
B
1/1/2000
100
0.11
B
2/1/2000
200
0.22
B
3/1/2000
300
0.33
B
4/1/2000
400
0.44
I guess I could use groupby?
I think your data is missing another year to be representative, if you do not have just a single year in the DataFrame.
I just added one line for 2001:
import pandas as pd
df['Date'] = pd.to_datetime(df['Date'])
print(df)
ID Date Value
0 A 2000-01-01 5
1 A 2000-02-01 10
2 A 2000-03-01 20
3 A 2000-04-01 10
4 B 2000-01-01 100
5 B 2000-02-01 200
6 B 2000-03-01 300
7 B 2000-04-01 400
8 B 2001-04-01 20
If I understood correctly you can do it like this:
df['Fraction'] = (df['Value'] / df.groupby(['ID', df['Date'].dt.year])['Value'].transform('sum')).round(2)
print(df)
ID Date Value Fraction
0 A 2000-01-01 5 0.11
1 A 2000-02-01 10 0.22
2 A 2000-03-01 20 0.44
3 A 2000-04-01 10 0.22
4 B 2000-01-01 100 0.10
5 B 2000-02-01 200 0.20
6 B 2000-03-01 300 0.30
7 B 2000-04-01 400 0.40
8 B 2001-04-01 20 1.00
You can divide the Value column by the result of a groupby.transform sum, followed by round(2) to match your expected output:
df['Fraction'] = df['Value'] / df.groupby('ID')['Value'].transform('sum')
df['Fraction'] = df['Fraction'].round(2)
print(df)
ID Date Value Fraction
0 A 1/1/2000 5 0.11
1 A 2/1/2000 10 0.22
2 A 3/1/2000 20 0.44
3 A 4/1/2000 10 0.22
4 B 1/1/2000 100 0.10
5 B 2/1/2000 200 0.20
6 B 3/1/2000 300 0.30
7 B 4/1/2000 400 0.40

python: How to graph Sankey graph in a dash app

I have been trying to create a Sankey graph on a dash application using the plotly tutorial, but I have had no sucess in achieving that. I have been stuck with the below error dash.exceptions.InvalidCallbackReturnValue: The callback ..my_bee_map.figure.. is a multi-output. I can figure out how to fix that.
my end objective is to see the flow of Labels and years with values >=0.75. what that means is that when I choose a label e.g cs.AI, I want to see the flow of labels, years with >=0.75 values correlated to the label cs.AI
Here is my dash app and the values for the plot.
import pandas as pd
import plotly.express as px # (version 4.7.0 or higher)
import plotly.graph_objects as go
import dash_core_components as dcc
import dash_html_components as html
from dash.dependencies import Input, Output
import numpy as np
from dash import Dash # pip install dash (version 2.0.0 or higher)
app = Dash(__name__)
df = pd.read_csv("data.csv")
app.layout = html.Div([
html.H1("Web Application Dashboards with Dash", style={'text-align': 'center'}),
dcc.Dropdown(id="slct_label",
options=[{'label': x, 'value': x} for x in
sorted(df["Label1"].unique())],
multi=False,
value="cs.AI",
style={'width': "30%"}
),
html.Br(),
dcc.Dropdown(id="slct_value",
options=[{'label': x, 'value': x} for x in
sorted(df[df["value"] >= 0.75]["value"].unique())],
multi=False,
value=0.75,
style={'width': "40%"},
placeholder="Select threshold"
),
# html.Div(id='output_container', children=[]),
html.Br(),
dcc.Graph(id='my_bee_map', figure={})
])
# Callback - app interactivity section------------------------------------
#app.callback(
[Output(component_id='output_container', component_property='children'),
Output(component_id='my_bee_map', component_property='figure')],
[Input(component_id='slct_label', component_property='value'),
Input(component_id='slct_value', component_property='value')]
)
def update_graph(slct_label, slct_value):
# print(slct_label,slct_value)
# print(type(slct_label), type(slct_value))
container = "The year chosen by user was: {}".format(slct_label)
dff = df.copy()
# if slct_label:
dff = df[df['Label1'] != slct_label]
label = list(dff['Label1'].unique())
source = np.repeat(np.arange(0, 5), 18).tolist()
Value = list(dff['value'])
target = list(dff['Label1'])
color = np.repeat(np.array(['#a6cee3','#fdbf6f','#fb9a99','#e3a6ce','#a6e3da'],dtype=object), 18).tolist()
link = dict(source=source,target=target,value=Value,color=color)
node = dict(label=label,pad=35,thickness=15)
data = go.Sankey(link=link,node=node)
#graph the Sankey
fig = go.Figure(data)
fig.update_layout(
hovermode= 'x',
title = 'Migration from 1990 to 1992',
font=dict(size=10,color='white'),
paper_bgcolor='#51504f')
return fig, container
if __name__ == '__main__':
app.run_server()
my data
Label1 value year
0 cs.AI 1.00 1990
1 cs.AI 0.20 1990
2 cs.AI 0.85 1990
3 cs.AI 0.99 1990
4 cs.AI 0.19 1990
5 cs.AI 0.87 1990
6 cs.CC 0.19 1990
7 cs.CC 1.00 1990
8 cs.CC 0.34 1990
9 cs.CC 0.50 1990
10 cs.CC 0.09 1990
11 cs.CC 0.67 1990
12 cs.CE 0.94 1990
13 cs.CE 0.63 1990
14 cs.CE 1.00 1990
15 cs.CE 0.61 1990
16 cs.CE 0.82 1990
17 cs.CE 0.17 1990
18 cs.CG 0.74 1990
19 cs.CG 0.95 1990
20 cs.CG 0.53 1990
21 cs.CG 1.00 1990
22 cs.CG 0.43 1990
23 cs.CG 0.10 1990
24 cs.CL 0.31 1990
25 cs.CL 0.27 1990
26 cs.CL 0.91 1990
27 cs.CL 0.21 1990
28 cs.CL 1.00 1990
29 cs.CL 0.12 1990
30 cs.CR 0.31 1990
31 cs.CR 0.18 1990
32 cs.CR 0.76 1990
33 cs.CR 0.35 1990
34 cs.CR 0.67 1990
35 cs.CR 1.00 1990
36 cs.AI 1.00 1991
37 cs.AI 0.55 1991
38 cs.AI 0.82 1991
39 cs.AI 0.05 1991
40 cs.AI 0.17 1991
41 cs.AI 0.83 1991
42 cs.CC 0.52 1991
43 cs.CC 1.00 1991
44 cs.CC 0.64 1991
45 cs.CC 1.00 1991
46 cs.CC 0.80 1991
47 cs.CC 0.21 1991
48 cs.CE 0.10 1991
49 cs.CE 0.58 1991
50 cs.CE 1.00 1991
51 cs.CE 0.01 1991
52 cs.CE 0.77 1991
53 cs.CE 0.19 1991
54 cs.CG 0.08 1991
55 cs.CG 0.21 1991
56 cs.CG 0.63 1991
57 cs.CG 1.00 1991
58 cs.CG 0.34 1991
59 cs.CG 0.60 1991
60 cs.CL 0.91 1991
61 cs.CL 0.33 1991
62 cs.CL 0.60 1991
63 cs.CL 0.57 1991
64 cs.CL 1.00 1991
65 cs.CL 0.37 1991
66 cs.CR 1.00 1991
67 cs.CR 0.28 1991
68 cs.CR 0.92 1991
69 cs.CR 0.47 1991
70 cs.CR 0.53 1991
71 cs.CR 1.00 1991
72 cs.AI 1.00 1992
73 cs.AI 0.79 1992
74 cs.AI 0.86 1992
75 cs.AI 0.30 1992
76 cs.AI 0.27 1992
77 cs.AI 0.91 1992
78 cs.CC 0.06 1992
79 cs.CC 1.00 1992
80 cs.CC 0.72 1992
81 cs.CC 0.44 1992
82 cs.CC 0.31 1992
83 cs.CC 0.75 1992
84 cs.CE 0.40 1992
85 cs.CE 0.07 1992
86 cs.CE 1.00 1992
87 cs.CE 0.88 1992
88 cs.CE 0.79 1992
89 cs.CE 0.03 1992
90 cs.CG 0.74 1992
91 cs.CG 0.91 1992
92 cs.CG 1.00 1992
93 cs.CG 1.00 1992
94 cs.CG 0.68 1992
95 cs.CG 0.22 1992
96 cs.CL 0.42 1992
97 cs.CL 0.03 1992
98 cs.CL 0.95 1992
99 cs.CL 0.17 1992
100 cs.CL 1.00 1992
101 cs.CL 0.28 1992
102 cs.CR 0.04 1992
103 cs.CR 0.30 1992
104 cs.CR 0.26 1992
105 cs.CR 0.80 1992
106 cs.CR 0.90 1992
107 cs.CR 1.00 1992
what you put after "return" statement has to be placed in the same order as in callbacks. It means that in your code, the return should be: "return container, fig" instead of "return fig, container".
If it does not work, try to separate callback with one output sankey figure only (and do not use square bracket outside Output statement when you have only one output). You can make another callback for container output.

Expand time series data in pandas dataframe

I am attempting to interpolate between time points for all data in a pandas dataframe. My current data is in time increments of 0.04 seconds. I want it to be in increments of 0.01 seconds to match another data set. I realize I can use the DataFrame.interpolate() function to do this. However, I am stuck on how to insert 3 rows of NaN in-between every row of my dataframe in an efficient manner.
import pandas as pd
import numpy as np
df = pd.DataFrame(data={"Time": [0.0, 0.04, 0.08, 0.12],
"Pulse": [76, 74, 77, 80],
"O2":[99, 100, 99, 98]})
df_ins = pd.DataFrame(data={"Time": [np.nan, np.nan, np.nan],
"Pulse": [np.nan, np.nan, np.nan],
"O2":[np.nan, np.nan, np.nan]})
I want df to transform from this:
Time Pulse O2
0 0.00 76 99
1 0.04 74 100
2 0.08 77 99
3 0.12 80 98
To something like this:
Time Pulse O2
0 0.00 76 99
1 NaN NaN NaN
2 NaN NaN NaN
3 NaN NaN NaN
4 0.04 74 100
5 NaN NaN NaN
6 NaN NaN NaN
7 NaN NaN NaN
8 0.08 77 99
9 NaN NaN NaN
10 NaN NaN NaN
11 NaN NaN NaN
12 0.12 80 98
Which I can then call on
df = df.interpolate()
Which would yield something like this (I'm making up the numbers here):
Time Pulse O2
0 0.00 76 99
1 0.01 76 99
2 0.02 75 99
3 0.03 74 100
4 0.04 74 100
5 0.05 75 100
6 0.06 76 99
7 0.07 77 99
8 0.08 77 99
9 0.09 77 99
10 0.10 78 98
11 0.11 79 98
12 0.12 80 98
I attempted to use an iterrows technique by inserting the df_ins frame after every row. But my index was thrown off during the iteration. I also tried slicing df and concatenating the df slices and df_ins, but once again the indexes were thrown off by the loop.
Does anyone have any recommendations on how to do this efficiently?
Use resample here (replace ffill with your desired behavior, maybe mess around with interpolate)
df["Time"] = pd.to_timedelta(df["Time"], unit="S")
df.set_index("Time").resample("0.01S").ffill()
Pulse O2
Time
00:00:00 76 99
00:00:00.010000 76 99
00:00:00.020000 76 99
00:00:00.030000 76 99
00:00:00.040000 74 100
00:00:00.050000 74 100
00:00:00.060000 74 100
00:00:00.070000 74 100
00:00:00.080000 77 99
00:00:00.090000 77 99
00:00:00.100000 77 99
00:00:00.110000 77 99
00:00:00.120000 80 98
If you do want to interpolate:
df.set_index("Time").resample("0.01S").interpolate()
Pulse O2
Time
00:00:00 76.00 99.00
00:00:00.010000 75.50 99.25
00:00:00.020000 75.00 99.50
00:00:00.030000 74.50 99.75
00:00:00.040000 74.00 100.00
00:00:00.050000 74.75 99.75
00:00:00.060000 75.50 99.50
00:00:00.070000 76.25 99.25
00:00:00.080000 77.00 99.00
00:00:00.090000 77.75 98.75
00:00:00.100000 78.50 98.50
00:00:00.110000 79.25 98.25
00:00:00.120000 80.00 98.00
I believe using np.linspace and process column-wise should be faster than interpolate (if your Time column is not exactly in time format):
import numpy as np
import pandas as pd
new_dict = {}
for c in df.columns:
arr = df[c]
ret = []
for i in range(1, len(arr)):
ret.append(np.linspace(arr[i-1], arr[i], 4, endpoint=False)[1:])
new_dict[c] = np.concatenate(ret)
pd.concat([df, pd.DataFrame(new_dict)]).sort_values('Time').reset_index(drop=True)
Time Pulse O2
0 0.00 76.00 99.00
1 0.01 75.50 99.25
2 0.02 75.00 99.50
3 0.03 74.50 99.75
4 0.04 74.00 100.00
5 0.05 74.75 99.75
6 0.06 75.50 99.50
7 0.07 76.25 99.25
8 0.08 77.00 99.00
9 0.09 77.75 98.75
10 0.10 78.50 98.50
11 0.11 79.25 98.25
12 0.12 80.00 98.00

Pandas rolling cumulative sum of across two dataframes

I'm looking to create a rolling grouped cumulative sum across two dataframes. I can get the result via iteration, but wanted to see if there was a more intelligent way.
I need the 5 row block of A to roll through the rows of B and accumulate. Think of it as rolling balance with a block of contributions and rolling returns.
So, here's the calculation for C
A B
1 100.00 1 0.01 101.00
2 110.00 2 0.02 215.22 102.00
3 120.00 3 0.03 345.28 218.36 103.00
4 130.00 4 0.04 494.29 351.89 221.52 104.00
5 140.00 5 0.05 666.00 505.99 358.60 224.70 105.00
6 0.06 684.75 517.91 365.38 227.90 106.00
7 0.07 703.97 530.06 372.25 231.12
8 0.08 723.66 542.43 379.21
9 0.09 743.85 555.04
10 0.10 764.54
C Row 5
Begining Balance Contribution Return Ending Balance
0.00 100.00 0.01 101.00
101.00 110.00 0.02 215.22
215.22 120.00 0.03 345.28
345.28 130.00 0.04 494.29
494.29 140.00 0.05 666.00
C Row 6
Begining Balance Contribution Return Ending Balance
0.00 100.00 0.02 102.00
102.00 110.00 0.03 218.36
218.36 120.00 0.04 351.89
351.89 130.00 0.05 505.99
505.99 140.00 0.06 684.75
Here's what the source data looks like:
A B
1 100.00 1 0.01
2 110.00 2 0.02
3 120.00 3 0.03
4 130.00 4 0.04
5 140.00 5 0.05
6 0.06
7 0.07
8 0.08
9 0.09
10 0.10
Here is the desired result:
C
1 Nan
2 Nan
3 Nan
4 Nan
5 666.00
6 684.75
7 703.97
8 723.66
9 743.85
10 764.54

Pandas round is not working for DataFrame

Round works on a single element but not the DataFrame, tried DataFrame.round() but didn't work... any idea? Thanks.
Have code below:
print "Panda Version: ", pd.__version__
print "['5am'][0]: ", x3['5am'][0]
print "Round element: ", np.round(x3['5am'][0]*4) /4
print "Round Dataframe: \r\n", np.round(x3 * 4, decimals=2) / 4
df = np.round(x3 * 4, decimals=2) / 4
print "Round Dataframe Again: \r\n", df.round(2)
Got result:
Panda Version: 0.18.0
['5am'][0]: 0.279914529915
Round element: 0.25
Round Dataframe:
5am 6am 7am 8am 9am 10am 11am
Date
2016-07-11 0.279915 0.279915 2.85256 4.52778 6.23291 9.01496 8.53632
2016-07-12 0.339744 0.369658 2.67308 4.52778 5.00641 7.30983 6.98077
2016-07-13 0.399573 0.459402 2.61325 3.83974 5.48504 6.77137 5.24573
2016-07-14 0.339744 0.549145 2.64316 3.36111 5.66453 5.96368 7.87821
2016-07-15 0.309829 0.459402 2.55342 4.64744 4.46795 6.80128 6.17308
2016-07-16 0.25 0.369658 2.46368 2.67308 4.58761 6.35256 5.63462
2016-07-17 0.279915 0.369658 2.58333 2.91239 4.19872 5.51496 6.65171
Round Dataframe Again:
5am 6am 7am 8am 9am 10am 11am
Date
2016-07-11 0.279915 0.279915 2.85256 4.52778 6.23291 9.01496 8.53632
2016-07-12 0.339744 0.369658 2.67308 4.52778 5.00641 7.30983 6.98077
2016-07-13 0.399573 0.459402 2.61325 3.83974 5.48504 6.77137 5.24573
2016-07-14 0.339744 0.549145 2.64316 3.36111 5.66453 5.96368 7.87821
2016-07-15 0.309829 0.459402 2.55342 4.64744 4.46795 6.80128 6.17308
2016-07-16 0.25 0.369658 2.46368 2.67308 4.58761 6.35256 5.63462
2016-07-17 0.279915 0.369658 2.58333 2.91239 4.19872 5.51496 6.65171
Try to cast to float type:
x3.astype(float).round(2)
as simple as this
df['col_name'] = df['col_name'].astype(float).round(2)
Explanation of your code:
In [166]: np.round(df * 4, decimals=2)
Out[166]:
a b c d
0 0.11 0.45 1.65 3.38
1 3.97 2.90 1.89 3.42
2 1.46 0.79 3.00 1.44
3 3.48 2.33 0.81 1.02
4 1.03 0.65 1.94 2.92
5 1.88 2.21 0.59 0.39
6 0.08 2.09 4.00 1.02
7 2.86 0.71 3.56 0.57
8 1.23 1.38 3.47 0.03
9 3.09 1.10 1.12 3.31
In [167]: np.round(df * 4, decimals=2) / 4
Out[167]:
a b c d
0 0.0275 0.1125 0.4125 0.8450
1 0.9925 0.7250 0.4725 0.8550
2 0.3650 0.1975 0.7500 0.3600
3 0.8700 0.5825 0.2025 0.2550
4 0.2575 0.1625 0.4850 0.7300
5 0.4700 0.5525 0.1475 0.0975
6 0.0200 0.5225 1.0000 0.2550
7 0.7150 0.1775 0.8900 0.1425
8 0.3075 0.3450 0.8675 0.0075
9 0.7725 0.2750 0.2800 0.8275
In [168]: np.round(np.round(df * 4, decimals=2) / 4, 2)
Out[168]:
a b c d
0 0.03 0.11 0.41 0.84
1 0.99 0.72 0.47 0.86
2 0.36 0.20 0.75 0.36
3 0.87 0.58 0.20 0.26
4 0.26 0.16 0.48 0.73
5 0.47 0.55 0.15 0.10
6 0.02 0.52 1.00 0.26
7 0.72 0.18 0.89 0.14
8 0.31 0.34 0.87 0.01
9 0.77 0.28 0.28 0.83
This is working properly for me (pandas 0.18.1)
In [162]: df = pd.DataFrame(np.random.rand(10,4), columns=list('abcd'))
In [163]: df
Out[163]:
a b c d
0 0.028700 0.112959 0.412192 0.845663
1 0.991907 0.725550 0.472020 0.856240
2 0.365117 0.197468 0.750554 0.360272
3 0.870041 0.582081 0.203692 0.255915
4 0.257433 0.161543 0.483978 0.730548
5 0.470767 0.553341 0.146612 0.096358
6 0.020052 0.522482 0.999089 0.254312
7 0.714934 0.178061 0.889703 0.143701
8 0.308284 0.344552 0.868151 0.007825
9 0.771984 0.274245 0.280431 0.827999
In [164]: df.round(2)
Out[164]:
a b c d
0 0.03 0.11 0.41 0.85
1 0.99 0.73 0.47 0.86
2 0.37 0.20 0.75 0.36
3 0.87 0.58 0.20 0.26
4 0.26 0.16 0.48 0.73
5 0.47 0.55 0.15 0.10
6 0.02 0.52 1.00 0.25
7 0.71 0.18 0.89 0.14
8 0.31 0.34 0.87 0.01
9 0.77 0.27 0.28 0.83
Similar issue. df.round(1) didn't round as expected (e.g. .400000000123) but df.astype('float64').round(1) worked. Significantly, the dtype of df is float32. Apparently round() doesn't work properly on float32. How is this behavior not a bug?
As I just found here,
"round does not modify in-place. Rather, it returns the dataframe
rounded."
It might be helpful to think of this as follows:
df.round(2) is doing the correct rounding operation, but you are not asking it to see the result or saving it anywhere.
Thus, df_final = df.round(2) will likely complete your expected functionality, instead of just df.round(2). That's because the results of the rounding operation are now being saved to the df_final dataframe.
Additionally, it might be best to do one additional thing and use df_final = df.round(2).copy() instead of simply df_final = df.round(2). I find that some things return unexpected results if I don't assign a copy of the old dataframe to the new dataframe.
I've tried to reproduce your situation. and it seems to work nicely.
import pandas as pd
import numpy as np
from io import StringIO
s = """Date 5am 6am 7am 8am 9am 10am 11am
2016-07-11 0.279915 0.279915 2.85256 4.52778 6.23291 9.01496 8.53632
2016-07-12 0.339744 0.369658 2.67308 4.52778 5.00641 7.30983 6.98077
2016-07-13 0.399573 0.459402 2.61325 3.83974 5.48504 6.77137 5.24573
2016-07-14 0.339744 0.549145 2.64316 3.36111 5.66453 5.96368 7.87821
2016-07-15 0.309829 0.459402 2.55342 4.64744 4.46795 6.80128 6.17308
2016-07-16 0.25 0.369658 2.46368 2.67308 4.58761 6.35256 5.63462
2016-07-17 0.279915 0.369658 2.58333 2.91239 4.19872 5.51496 6.65171
"""
df = pd.read_table(StringIO(s), delim_whitespace=True)
df.set_index('Date').round(2)

Categories

Resources