Copying table using pd.read_html - python

Using pd.read_html in python, I am trying to copy a table from the following website:
https://finance.naver.com/sise/investorDealTrendDay.nhn?bizdate=215600&sosok=&page=2
import pandas as pd
df = pd.DataFrame()
df = df.append(pd.read_html(pg_url, header=0)[0], ignore_index=False)
Yet, I can't copy the numbers for some reason...
I'd appreciate your help on figuring out what went wrong

For me working well, remove header=0 and then only NaNs rows:
url ='https://finance.naver.com/sise/investorDealTrendDay.nhn?bizdate=215600&sosok=&page=2'
df = pd.read_html(url)[0].dropna(how='all')
print (df)
날짜 개인 외국인 기관계 기관 \
날짜 개인 외국인 기관계 금융투자 보험 투신(사모) 은행 기타금융기관
0 20.08.06 -850.0 1638.0 -801.0 2247.0 -517.0 -993.0 46.0 -138.0
1 20.08.05 4315.0 -516.0 -3666.0 -1277.0 -441.0 -871.0 -18.0 -30.0
2 20.08.04 1844.0 -583.0 -1488.0 392.0 -493.0 -205.0 14.0 -54.0
3 20.08.03 6237.0 -2687.0 -3795.0 -2841.0 -108.0 -411.0 0.0 -5.0
4 20.07.31 4716.0 -556.0 -3861.0 -2659.0 -129.0 -709.0 -7.0 -4.0
8 20.07.30 64.0 2247.0 -2342.0 423.0 -171.0 -428.0 -3.0 -13.0
9 20.07.29 476.0 2936.0 -3368.0 -1346.0 -296.0 -698.0 -8.0 -92.0
10 20.07.28 -10495.0 13060.0 -2220.0 -1440.0 -526.0 318.0 12.0 -76.0
11 20.07.27 -2996.0 1584.0 1395.0 1968.0 -20.0 161.0 -179.0 -58.0
12 20.07.24 2881.0 876.0 -3678.0 -1173.0 -545.0 -843.0 -43.0 -8.0
기타법인
연기금등 기타법인
0 -1446.0 13.0
1 -1029.0 -133.0
2 -1142.0 227.0
3 -429.0 246.0
4 -352.0 -299.0
8 -2151.0 30.0
9 -929.0 -44.0
10 -507.0 -345.0
11 -476.0 16.0
12 -1066.0 -79.0
If need first column to index and then to DatetimeIndex:
url ='https://finance.naver.com/sise/investorDealTrendDay.nhn?bizdate=215600&sosok=&page=2'
df = pd.read_html(url, index_col=0)[0].dropna(how='all')
df.index = pd.to_datetime(df.index, format='%y.%m.%d')
print (df)
날짜 개인 외국인 기관계 기관 \
날짜 개인 외국인 기관계 금융투자 보험 투신(사모) 은행 기타금융기관
2020-08-06 -850.0 1638.0 -801.0 2247.0 -517.0 -993.0 46.0 -138.0
2020-08-05 4315.0 -516.0 -3666.0 -1277.0 -441.0 -871.0 -18.0 -30.0
2020-08-04 1844.0 -583.0 -1488.0 392.0 -493.0 -205.0 14.0 -54.0
2020-08-03 6237.0 -2687.0 -3795.0 -2841.0 -108.0 -411.0 0.0 -5.0
2020-07-31 4716.0 -556.0 -3861.0 -2659.0 -129.0 -709.0 -7.0 -4.0
2020-07-30 64.0 2247.0 -2342.0 423.0 -171.0 -428.0 -3.0 -13.0
2020-07-29 476.0 2936.0 -3368.0 -1346.0 -296.0 -698.0 -8.0 -92.0
2020-07-28 -10495.0 13060.0 -2220.0 -1440.0 -526.0 318.0 12.0 -76.0
2020-07-27 -2996.0 1584.0 1395.0 1968.0 -20.0 161.0 -179.0 -58.0
2020-07-24 2881.0 876.0 -3678.0 -1173.0 -545.0 -843.0 -43.0 -8.0
날짜 기타법인
날짜 연기금등 기타법인
2020-08-06 -1446.0 13.0
2020-08-05 -1029.0 -133.0
2020-08-04 -1142.0 227.0
2020-08-03 -429.0 246.0
2020-07-31 -352.0 -299.0
2020-07-30 -2151.0 30.0
2020-07-29 -929.0 -44.0
2020-07-28 -507.0 -345.0
2020-07-27 -476.0 16.0
2020-07-24 -1066.0 -79.0

Related

Create a line/area chart as a gantt chart with plotly

I'm trying to create a line/area chart, which looks like a gantt chart with plotly in python. That's because i do not have a column of start and end (required for px.timeline). Instead, i have several vectors which begins in a certain place in time and decrease over several months. To better illustrate, thats my dataframe:
periods 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18
start
2018-12 2.0 2.0 2.0 2.0 2.0 2.0 2.0 2.0 1.0 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
2019-01 252.0 240.0 228.0 208.0 199.0 182.0 168.0 152.0 141.0 132.0 120.0 108.0 91.0 77.0 66.0 52.0 37.0 19.0 7.0
2019-02 140.0 135.0 129.0 123.0 114.0 101.0 99.0 91.0 84.0 74.0 62.0 49.0 45.0 39.0 33.0 26.0 20.0 10.0 3.0
2019-03 97.0 93.0 85.0 79.0 73.0 68.0 62.0 60.0 54.0 50.0 45.0 41.0 37.0 31.0 23.0 18.0 11.0 4.0 NaN
2019-04 92.0 90.0 86.0 82.0 78.0 73.0 67.0 58.0 51.0 46.0 41.0 38.0 36.0 34.0 32.0 19.0 14.0 3.0 1.0
2019-05 110.0 106.0 98.0 94.0 88.0 84.0 81.0 74.0 66.0 64.0 61.0 53.0 42.0 37.0 32.0 20.0 15.0 11.0 1.0
2019-06 105.0 101.0 96.0 87.0 84.0 80.0 75.0 69.0 65.0 60.0 56.0 46.0 40.0 32.0 30.0 18.0 10.0 6.0 2.0
2019-07 123.0 121.0 113.0 105.0 97.0 90.0 82.0 77.0 74.0 69.0 68.0 66.0 55.0 47.0 36.0 32.0 24.0 11.0 2.0
2019-08 127.0 122.0 117.0 112.0 108.0 100.0 94.0 82.0 78.0 69.0 65.0 58.0 53.0 43.0 35.0 24.0 17.0 8.0 2.0
2019-09 122.0 114.0 106.0 100.0 90.0 83.0 76.0 69.0 58.0 50.0 45.0 39.0 32.0 28.0 24.0 17.0 8.0 5.0 1.0
2019-10 164.0 161.0 151.0 138.0 129.0 121.0 114.0 102.0 95.0 88.0 81.0 72.0 62.0 56.0 48.0 40.0 22.0 16.0 5.0
2019-11 216.0 214.0 202.0 193.0 181.0 165.0 150.0 139.0 126.0 116.0 107.0 95.0 82.0 65.0 54.0 44.0 31.0 14.0 7.0
2019-12 341.0 327.0 311.0 294.0 274.0 261.0 245.0 225.0 210.0 191.0 171.0 136.0 117.0 96.0 79.0 55.0 45.0 26.0 6.0
2020-01 1167.0 1139.0 1089.0 1009.0 948.0 881.0 826.0 745.0 682.0 608.0 539.0 473.0 401.0 346.0 292.0 244.0 171.0 90.0 31.0
2020-02 280.0 274.0 262.0 247.0 239.0 226.0 204.0 184.0 169.0 158.0 141.0 125.0 105.0 89.0 68.0 55.0 29.0 18.0 3.0
2020-03 723.0 713.0 668.0 629.0 581.0 537.0 499.0 462.0 419.0 384.0 340.0 293.0 268.0 215.0 172.0 136.0 103.0 67.0 19.0
2020-04 1544.0 1502.0 1420.0 1337.0 1256.0 1149.0 1065.0 973.0 892.0 795.0 715.0 637.0 538.0 463.0 371.0 283.0 199.0 111.0 29.0
2020-05 1355.0 1313.0 1241.0 1175.0 1102.0 1046.0 970.0 890.0 805.0 726.0 652.0 569.0 488.0 415.0 331.0 255.0 180.0 99.0 19.0
2020-06 1042.0 1009.0 949.0 886.0 834.0 784.0 740.0 670.0 611.0 558.0 493.0 438.0 380.0 312.0 257.0 195.0 125.0 78.0 NaN
2020-07 719.0 698.0 663.0 624.0 595.0 547.0 512.0 460.0 424.0 387.0 341.0 301.0 256.0 215.0 172.0 124.0 90.0 NaN NaN
2020-08 655.0 633.0 605.0 566.0 537.0 492.0 453.0 417.0 377.0 333.0 294.0 259.0 222.0 189.0 162.0 118.0 NaN NaN NaN
2020-09 715.0 687.0 647.0 617.0 562.0 521.0 479.0 445.0 408.0 371.0 331.0 297.0 257.0 208.0 165.0 NaN NaN NaN NaN
2020-10 345.0 333.0 313.0 297.0 284.0 267.0 252.0 225.0 201.0 183.0 159.0 141.0 123.0 108.0 NaN NaN NaN NaN NaN
2020-11 1254.0 1221.0 1162.0 1094.0 1027.0 965.0 892.0 816.0 743.0 682.0 607.0 549.0 464.0 NaN NaN NaN NaN NaN NaN
2020-12 387.0 379.0 352.0 338.0 319.0 292.0 275.0 257.0 230.0 207.0 185.0 157.0 NaN NaN NaN NaN NaN NaN NaN
2021-01 805.0 782.0 742.0 692.0 649.0 599.0 551.0 500.0 463.0 417.0 371.0 NaN NaN NaN NaN NaN NaN NaN NaN
2021-02 469.0 458.0 434.0 407.0 380.0 357.0 336.0 317.0 296.0 263.0 NaN NaN NaN NaN NaN NaN NaN NaN NaN
2021-03 1540.0 1491.0 1390.0 1302.0 1221.0 1128.0 1049.0 967.0 864.0 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
2021-04 1265.0 1221.0 1145.0 1086.0 1006.0 937.0 862.0 793.0 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
2021-05 558.0 548.0 520.0 481.0 446.0 417.0 389.0 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
2021-06 607.0 589.0 560.0 517.0 484.0 455.0 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
2021-07 597.0 572.0 543.0 511.0 477.0 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
2021-08 923.0 902.0 850.0 792.0 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
2021-09 975.0 952.0 899.0 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
2021-10 647.0 628.0 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
2021-11 131.0 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
As you can see, for each period, i have a start at 0, until the last period available. Right now, my code is this:
vectors = []
for i in pivot_period.index:
vectors.append(list(pivot_period.loc[i]))
fig = px.area(y=[i for i in vectors])
If you plot the graph, you will see that the x-axis is the number of periods. However, when i try to implement the dates (which are the index), it returns a mislength, as long as i have 18 periods vs 36 dates. My idea, is to plot something like this (sorry for the terrible pic):
In a way that could visualize a decay of each vector in its own timeline. Any ideas?
generating an area figure from this data is simple: px.area(df, x=df.index, y=df.columns)
I do not see where the jobs/tasks come from in this dataset to match the image attached
df = pd.read_csv(io.StringIO("""periods 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18
start
2018-12 2.0 2.0 2.0 2.0 2.0 2.0 2.0 2.0 1.0 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
2019-01 252.0 240.0 228.0 208.0 199.0 182.0 168.0 152.0 141.0 132.0 120.0 108.0 91.0 77.0 66.0 52.0 37.0 19.0 7.0
2019-02 140.0 135.0 129.0 123.0 114.0 101.0 99.0 91.0 84.0 74.0 62.0 49.0 45.0 39.0 33.0 26.0 20.0 10.0 3.0
2019-03 97.0 93.0 85.0 79.0 73.0 68.0 62.0 60.0 54.0 50.0 45.0 41.0 37.0 31.0 23.0 18.0 11.0 4.0 NaN
2019-04 92.0 90.0 86.0 82.0 78.0 73.0 67.0 58.0 51.0 46.0 41.0 38.0 36.0 34.0 32.0 19.0 14.0 3.0 1.0
2019-05 110.0 106.0 98.0 94.0 88.0 84.0 81.0 74.0 66.0 64.0 61.0 53.0 42.0 37.0 32.0 20.0 15.0 11.0 1.0
2019-06 105.0 101.0 96.0 87.0 84.0 80.0 75.0 69.0 65.0 60.0 56.0 46.0 40.0 32.0 30.0 18.0 10.0 6.0 2.0
2019-07 123.0 121.0 113.0 105.0 97.0 90.0 82.0 77.0 74.0 69.0 68.0 66.0 55.0 47.0 36.0 32.0 24.0 11.0 2.0
2019-08 127.0 122.0 117.0 112.0 108.0 100.0 94.0 82.0 78.0 69.0 65.0 58.0 53.0 43.0 35.0 24.0 17.0 8.0 2.0
2019-09 122.0 114.0 106.0 100.0 90.0 83.0 76.0 69.0 58.0 50.0 45.0 39.0 32.0 28.0 24.0 17.0 8.0 5.0 1.0
2019-10 164.0 161.0 151.0 138.0 129.0 121.0 114.0 102.0 95.0 88.0 81.0 72.0 62.0 56.0 48.0 40.0 22.0 16.0 5.0
2019-11 216.0 214.0 202.0 193.0 181.0 165.0 150.0 139.0 126.0 116.0 107.0 95.0 82.0 65.0 54.0 44.0 31.0 14.0 7.0
2019-12 341.0 327.0 311.0 294.0 274.0 261.0 245.0 225.0 210.0 191.0 171.0 136.0 117.0 96.0 79.0 55.0 45.0 26.0 6.0
2020-01 1167.0 1139.0 1089.0 1009.0 948.0 881.0 826.0 745.0 682.0 608.0 539.0 473.0 401.0 346.0 292.0 244.0 171.0 90.0 31.0
2020-02 280.0 274.0 262.0 247.0 239.0 226.0 204.0 184.0 169.0 158.0 141.0 125.0 105.0 89.0 68.0 55.0 29.0 18.0 3.0
2020-03 723.0 713.0 668.0 629.0 581.0 537.0 499.0 462.0 419.0 384.0 340.0 293.0 268.0 215.0 172.0 136.0 103.0 67.0 19.0
2020-04 1544.0 1502.0 1420.0 1337.0 1256.0 1149.0 1065.0 973.0 892.0 795.0 715.0 637.0 538.0 463.0 371.0 283.0 199.0 111.0 29.0
2020-05 1355.0 1313.0 1241.0 1175.0 1102.0 1046.0 970.0 890.0 805.0 726.0 652.0 569.0 488.0 415.0 331.0 255.0 180.0 99.0 19.0
2020-06 1042.0 1009.0 949.0 886.0 834.0 784.0 740.0 670.0 611.0 558.0 493.0 438.0 380.0 312.0 257.0 195.0 125.0 78.0 NaN
2020-07 719.0 698.0 663.0 624.0 595.0 547.0 512.0 460.0 424.0 387.0 341.0 301.0 256.0 215.0 172.0 124.0 90.0 NaN NaN
2020-08 655.0 633.0 605.0 566.0 537.0 492.0 453.0 417.0 377.0 333.0 294.0 259.0 222.0 189.0 162.0 118.0 NaN NaN NaN
2020-09 715.0 687.0 647.0 617.0 562.0 521.0 479.0 445.0 408.0 371.0 331.0 297.0 257.0 208.0 165.0 NaN NaN NaN NaN
2020-10 345.0 333.0 313.0 297.0 284.0 267.0 252.0 225.0 201.0 183.0 159.0 141.0 123.0 108.0 NaN NaN NaN NaN NaN
2020-11 1254.0 1221.0 1162.0 1094.0 1027.0 965.0 892.0 816.0 743.0 682.0 607.0 549.0 464.0 NaN NaN NaN NaN NaN NaN
2020-12 387.0 379.0 352.0 338.0 319.0 292.0 275.0 257.0 230.0 207.0 185.0 157.0 NaN NaN NaN NaN NaN NaN NaN
2021-01 805.0 782.0 742.0 692.0 649.0 599.0 551.0 500.0 463.0 417.0 371.0 NaN NaN NaN NaN NaN NaN NaN NaN
2021-02 469.0 458.0 434.0 407.0 380.0 357.0 336.0 317.0 296.0 263.0 NaN NaN NaN NaN NaN NaN NaN NaN NaN
2021-03 1540.0 1491.0 1390.0 1302.0 1221.0 1128.0 1049.0 967.0 864.0 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
2021-04 1265.0 1221.0 1145.0 1086.0 1006.0 937.0 862.0 793.0 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
2021-05 558.0 548.0 520.0 481.0 446.0 417.0 389.0 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
2021-06 607.0 589.0 560.0 517.0 484.0 455.0 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
2021-07 597.0 572.0 543.0 511.0 477.0 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
2021-08 923.0 902.0 850.0 792.0 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
2021-09 975.0 952.0 899.0 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
2021-10 647.0 628.0 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
2021-11 131.0 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN""")
,sep="\s+").drop(0).set_index("periods")
px.area(df, x=df.index, y=df.columns)

How to sum the result of a Pandas Groupby based on the index value of the groupby

I can't figure out how to sum up a part of the results of a Pandas value_counts(). In this case: I need the sum of the values until index 8 (as in the result of the value_counts() (wicht is a long series)
Hopefully some can help me on this. Thank you all in advance.
I perform a value_count on one of my df columns with:
df_v2.Q_score_diff.value_counts().sort_index(ascending=False)T
The resulting Series: of the value_counts:
4
8.0 2
47.0 6
46.0 21
45.0 47
44.0 144
43.0 251
42.0 384
41.0 597
40.0 783
39.0 947
38.0 1225
37.0 1501
36.0 1822
35.0 2062
34.0 2312
33.0 2662
32.0 2907
31.0 3123
30.0 3349
29.0 3558
28.0 3862
27.0 3734
26.0 3878
25.0 3969
24.0 3997
23.0 3914
22.0 3907
21.0 3866
20.0 3624
19.0 3519
18.0 3396
17.0 3147
16.0 2894
15.0 2701
14.0 2475
13.0 2278
12.0 2077
11.0 1881
10.0 1611
9.0 1408
8.0 1304
7.0 1182
6.0 1042
5.0 845
4.0 735
3.0 722
2.0 615
1.0 534
0.0 505
-1.0 383
-2.0 330
-3.0 284
-4.0 227
-5.0 202
-6.0 148
-7.0 139
-8.0 112
-9.0 96
-10.0 65
-11.0 53
-12.0 46
-13.0 47
-14.0 31
-15.0 22
-16.0 19
-17.0 18
-18.0 12
-19.0 18
-20.0 8
-21.0 1
-22.0 10
-23.0 5
-24.0 7
-25.0 5
-26.0 2
-27.0 2
-28.0 2
-29.0 3
-32.0 4
-34.0 1
-35.0 2
-40.0 1
Is it what your are looking for:
>>> df.loc[:8.0]
47.0 6
46.0 21
45.0 47
44.0 144
43.0 251
42.0 384
41.0 597
40.0 783
39.0 947
38.0 1225
37.0 1501
36.0 1822
35.0 2062
34.0 2312
33.0 2662
32.0 2907
31.0 3123
30.0 3349
29.0 3558
28.0 3862
27.0 3734
26.0 3878
25.0 3969
24.0 3997
23.0 3914
22.0 3907
21.0 3866
20.0 3624
19.0 3519
18.0 3396
17.0 3147
16.0 2894
15.0 2701
14.0 2475
13.0 2278
12.0 2077
11.0 1881
10.0 1611
9.0 1408
8.0 1304

Why is pandas df.diff(2) different than df.diff().diff()?

According to Ender's Applied Econometric Time Series, the second difference of a variable y is defined as:
Pandas provides the diff function that receives "periods" as an argument. Nevertheless, df.diff(2) gives a different result than df.diff().diff().
Code excerpt showing the above:
In [8]: df
Out[8]:
C.1 C.2 C.3 C.4 C.5 C.6
C.0
1990 16.0 6.0 256.0 216.0 65536 4352
1991 17.0 7.0 289.0 343.0 131072 5202
1992 6.0 -4.0 36.0 -64.0 64 252
1993 7.0 -3.0 49.0 -27.0 128 392
1994 8.0 -2.0 64.0 -8.0 256 576
1995 13.0 3.0 169.0 27.0 8192 2366
1996 10.0 0.5 100.0 0.5 1024 1100
1997 11.0 1.0 121.0 1.0 2048 1452
1998 4.0 -6.0 16.0 -216.0 16 80
1999 5.0 -5.0 25.0 -125.0 32 150
2000 18.0 8.0 324.0 512.0 262144 6156
2001 3.0 -7.0 9.0 -343.0 8 36
2002 0.5 -10.0 0.5 -1000.0 48 20
2003 1.0 -9.0 1.0 -729.0 2 2
2004 14.0 4.0 196.0 64.0 16384 2940
2005 15.0 5.0 225.0 125.0 32768 3600
2006 12.0 2.0 144.0 8.0 4096 1872
2007 9.0 -1.0 81.0 -1.0 512 810
2008 2.0 -8.0 4.0 -512.0 4 12
2009 19.0 9.0 361.0 729.0 524288 7220
In [9]: df.diff(2)
Out[9]:
C.1 C.2 C.3 C.4 C.5 C.6
C.0
1990 NaN NaN NaN NaN NaN NaN
1991 NaN NaN NaN NaN NaN NaN
1992 -10.0 -10.0 -220.0 -280.0 -65472.0 -4100.0
1993 -10.0 -10.0 -240.0 -370.0 -130944.0 -4810.0
1994 2.0 2.0 28.0 56.0 192.0 324.0
1995 6.0 6.0 120.0 54.0 8064.0 1974.0
1996 2.0 2.5 36.0 8.5 768.0 524.0
1997 -2.0 -2.0 -48.0 -26.0 -6144.0 -914.0
1998 -6.0 -6.5 -84.0 -216.5 -1008.0 -1020.0
1999 -6.0 -6.0 -96.0 -126.0 -2016.0 -1302.0
2000 14.0 14.0 308.0 728.0 262128.0 6076.0
2001 -2.0 -2.0 -16.0 -218.0 -24.0 -114.0
2002 -17.5 -18.0 -323.5 -1512.0 -262096.0 -6136.0
2003 -2.0 -2.0 -8.0 -386.0 -6.0 -34.0
2004 13.5 14.0 195.5 1064.0 16336.0 2920.0
2005 14.0 14.0 224.0 854.0 32766.0 3598.0
2006 -2.0 -2.0 -52.0 -56.0 -12288.0 -1068.0
2007 -6.0 -6.0 -144.0 -126.0 -32256.0 -2790.0
2008 -10.0 -10.0 -140.0 -520.0 -4092.0 -1860.0
2009 10.0 10.0 280.0 730.0 523776.0 6410.0
In [10]: df.diff().diff()
Out[10]:
C.1 C.2 C.3 C.4 C.5 C.6
C.0
1990 NaN NaN NaN NaN NaN NaN
1991 NaN NaN NaN NaN NaN NaN
1992 -12.0 -12.0 -286.0 -534.0 -196544.0 -5800.0
1993 12.0 12.0 266.0 444.0 131072.0 5090.0
1994 0.0 0.0 2.0 -18.0 64.0 44.0
1995 4.0 4.0 90.0 16.0 7808.0 1606.0
1996 -8.0 -7.5 -174.0 -61.5 -15104.0 -3056.0
1997 4.0 3.0 90.0 27.0 8192.0 1618.0
1998 -8.0 -7.5 -126.0 -217.5 -3056.0 -1724.0
1999 8.0 8.0 114.0 308.0 2048.0 1442.0
2000 12.0 12.0 290.0 546.0 262096.0 5936.0
2001 -28.0 -28.0 -614.0 -1492.0 -524248.0 -12126.0
2002 12.5 12.0 306.5 198.0 262176.0 6104.0
2003 3.0 4.0 9.0 928.0 -86.0 -2.0
2004 12.5 12.0 194.5 522.0 16428.0 2956.0
2005 -12.0 -12.0 -166.0 -732.0 2.0 -2278.0
2006 -4.0 -4.0 -110.0 -178.0 -45056.0 -2388.0
2007 0.0 0.0 18.0 108.0 25088.0 666.0
2008 -4.0 -4.0 -14.0 -502.0 3076.0 264.0
2009 24.0 24.0 434.0 1752.0 524792.0 8006.0
In [11]: df.diff(2) - df.diff().diff()
Out[11]:
C.1 C.2 C.3 C.4 C.5 C.6
C.0
1990 NaN NaN NaN NaN NaN NaN
1991 NaN NaN NaN NaN NaN NaN
1992 2.0 2.0 66.0 254.0 131072.0 1700.0
1993 -22.0 -22.0 -506.0 -814.0 -262016.0 -9900.0
1994 2.0 2.0 26.0 74.0 128.0 280.0
1995 2.0 2.0 30.0 38.0 256.0 368.0
1996 10.0 10.0 210.0 70.0 15872.0 3580.0
1997 -6.0 -5.0 -138.0 -53.0 -14336.0 -2532.0
1998 2.0 1.0 42.0 1.0 2048.0 704.0
1999 -14.0 -14.0 -210.0 -434.0 -4064.0 -2744.0
2000 2.0 2.0 18.0 182.0 32.0 140.0
2001 26.0 26.0 598.0 1274.0 524224.0 12012.0
2002 -30.0 -30.0 -630.0 -1710.0 -524272.0 -12240.0
2003 -5.0 -6.0 -17.0 -1314.0 80.0 -32.0
2004 1.0 2.0 1.0 542.0 -92.0 -36.0
2005 26.0 26.0 390.0 1586.0 32764.0 5876.0
2006 2.0 2.0 58.0 122.0 32768.0 1320.0
2007 -6.0 -6.0 -162.0 -234.0 -57344.0 -3456.0
2008 -6.0 -6.0 -126.0 -18.0 -7168.0 -2124.0
2009 -14.0 -14.0 -154.0 -1022.0 -1016.0 -1596.0
Why are they different? Which one corresponds to the one defined in Ender's book?
This is precisely because
Δ2 yt = yt - 2 yt - 1 + yt - 2 ≠ yt - yt - 2.
The left hand side is df.diff().diff(), whereas the right hand side is df.diff(2). For the difference in difference, you want the left hand side.
Consider;
df
a
b
c
d
df.diff() is
NaN
b - a
c - b
d - c
df.diff(2) is
NaN
NaN
c - a
d - b
df.diff().diff() is
NaN
NaN
(c - b) - (b - a) = c - 2b + a
(d - c) - (c - b) = d - 2c + b
They're not the same, mathematically.

Merging two Pandas series with duplicate datetime indices

I have two Pandas series (d1 and d2) indexed by datetime and each containing one column of data with both float and NaN. Both indices are at one-day intervals, although the time entries are inconsistent with many periods of missing days. d1 ranges from 1974-12-16 to 2002-01-30. d2 ranges from 1997-12-19 to 2017-07-06. The period from 1997-12-19 to 2002-01-30 contains many duplicate indices between the two series. The data for duplicated indices is sometimes the same value, different values, or one value and NaN.
I would like to combine these two series into one, prioritizing the data from d2 anytime there are duplicate indices (that is, replace all d1 data with d2 data anytime there is a duplicated index). What is the most efficient way to do this among the many Pandas tools available (merge, join, concatenate etc.)?
Here is an example of my data:
In [7]: print d1
fldDate
1974-12-16 19.0
1974-12-17 28.0
1974-12-18 24.0
1974-12-19 18.0
1974-12-20 17.0
1974-12-21 28.0
1974-12-22 28.0
1974-12-23 10.0
1974-12-24 6.0
1974-12-25 5.0
1974-12-26 12.0
1974-12-27 19.0
1974-12-28 22.0
1974-12-29 20.0
1974-12-30 16.0
1974-12-31 12.0
1975-01-01 12.0
1975-01-02 15.0
1975-01-03 14.0
1975-01-04 15.0
1975-01-05 18.0
1975-01-06 21.0
1975-01-07 22.0
1975-01-08 18.0
1975-01-09 20.0
1975-01-10 12.0
1975-01-11 8.0
1975-01-12 -2.0
1975-01-13 13.0
1975-01-14 24.0
...
2002-01-01 18.0
2002-01-02 16.0
2002-01-03 NaN
2002-01-04 24.0
2002-01-05 23.0
2002-01-06 15.0
2002-01-07 22.0
2002-01-08 34.0
2002-01-09 35.0
2002-01-10 29.0
2002-01-11 21.0
2002-01-12 24.0
2002-01-13 NaN
2002-01-14 18.0
2002-01-15 14.0
2002-01-16 10.0
2002-01-17 5.0
2002-01-18 7.0
2002-01-19 7.0
2002-01-20 7.0
2002-01-21 11.0
2002-01-22 NaN
2002-01-23 9.0
2002-01-24 8.0
2002-01-25 15.0
2002-01-26 NaN
2002-01-27 NaN
2002-01-28 18.0
2002-01-29 13.0
2002-01-30 13.0
Name: MaxTempMid, dtype: float64
In [8]: print d2
fldDate
1997-12-19 22.0
1997-12-20 14.0
1997-12-21 18.0
1997-12-22 16.0
1997-12-23 16.0
1997-12-24 10.0
1997-12-25 12.0
1997-12-26 12.0
1997-12-27 9.0
1997-12-28 12.0
1997-12-29 18.0
1997-12-30 23.0
1997-12-31 28.0
1998-01-01 26.0
1998-01-02 29.0
1998-01-03 27.0
1998-01-04 22.0
1998-01-05 19.0
1998-01-06 17.0
1998-01-07 14.0
1998-01-08 14.0
1998-01-09 14.0
1998-01-10 16.0
1998-01-11 20.0
1998-01-12 21.0
1998-01-13 19.0
1998-01-14 20.0
1998-01-15 16.0
1998-01-16 17.0
1998-01-17 20.0
...
2017-06-07 68.0
2017-06-08 71.0
2017-06-09 71.0
2017-06-10 59.0
2017-06-11 41.0
2017-06-12 57.0
2017-06-13 58.0
2017-06-14 36.0
2017-06-15 50.0
2017-06-16 58.0
2017-06-17 54.0
2017-06-18 53.0
2017-06-19 58.0
2017-06-20 68.0
2017-06-21 71.0
2017-06-22 71.0
2017-06-23 59.0
2017-06-24 61.0
2017-06-25 65.0
2017-06-26 68.0
2017-06-27 71.0
2017-06-28 60.0
2017-06-29 54.0
2017-06-30 48.0
2017-07-01 60.0
2017-07-02 68.0
2017-07-03 65.0
2017-07-04 73.0
2017-07-05 74.0
2017-07-06 77.0
Name: MaxTempMid, dtype: float64
Let's use, combine_first:
df2.combine_first(df1)
Output:
fldDate
1974-12-16 19.0
1974-12-17 28.0
1974-12-18 24.0
1974-12-19 18.0
1974-12-20 17.0
1974-12-21 28.0
1974-12-22 28.0
1974-12-23 10.0
1974-12-24 6.0
1974-12-25 5.0
1974-12-26 12.0
1974-12-27 19.0
1974-12-28 22.0
1974-12-29 20.0
1974-12-30 16.0
1974-12-31 12.0
1975-01-01 12.0
1975-01-02 15.0
1975-01-03 14.0
1975-01-04 15.0
1975-01-05 18.0
1975-01-06 21.0
1975-01-07 22.0
1975-01-08 18.0
1975-01-09 20.0
1975-01-10 12.0
1975-01-11 8.0
1975-01-12 -2.0
1975-01-13 13.0
1975-01-14 24.0
...
2017-06-07 68.0
2017-06-08 71.0
2017-06-09 71.0
2017-06-10 59.0
2017-06-11 41.0
2017-06-12 57.0
2017-06-13 58.0
2017-06-14 36.0
2017-06-15 50.0
2017-06-16 58.0
2017-06-17 54.0
2017-06-18 53.0
2017-06-19 58.0
2017-06-20 68.0
2017-06-21 71.0
2017-06-22 71.0
2017-06-23 59.0
2017-06-24 61.0
2017-06-25 65.0
2017-06-26 68.0
2017-06-27 71.0
2017-06-28 60.0
2017-06-29 54.0
2017-06-30 48.0
2017-07-01 60.0
2017-07-02 68.0
2017-07-03 65.0
2017-07-04 73.0
2017-07-05 74.0
2017-07-06 77.0

Transposing dataframe column, creating different rows per day

I have a dataframe that has one column and a timestamp index including anywhere from 2 to 7 days:
kWh
Timestamp
2017-07-08 06:00:00 0.00
2017-07-08 07:00:00 752.75
2017-07-08 08:00:00 1390.20
2017-07-08 09:00:00 2027.65
2017-07-08 10:00:00 2447.27
.... ....
2017-07-12 20:00:00 167.64
2017-07-12 21:00:00 0.00
2017-07-12 22:00:00 0.00
2017-07-12 23:00:00 0.00
I would like to transpose the kWh column so that one day's worth of values (hourly granularity, so 24 values/day) fill up a row. And the next row is the next day of values and so on (so five days of forecasted data has five rows with 24 elements each).
Because my query of the data comes in the vertical format, and my regression and subsequent analysis already occurs in the vertical format, I don't want to change the process too much and am hoping there is a simpler way. I have tried giving a multi-index with df.index.hour and then using unstack(), but I get a huge dataframe with NaN values everywhere.
Is there an elegant way to do this?
If we start from a frame like
In [25]: df = pd.DataFrame({"kWh": [1]}, index=pd.date_range("2017-07-08",
"2017-07-12", freq="1H").rename("Timestamp")).cumsum()
In [26]: df.head()
Out[26]:
kWh
Timestamp
2017-07-08 00:00:00 1
2017-07-08 01:00:00 2
2017-07-08 02:00:00 3
2017-07-08 03:00:00 4
2017-07-08 04:00:00 5
we can make date and hour columns and then pivot:
In [27]: df["date"] = df.index.date
In [28]: df["hour"] = df.index.hour
In [29]: df.pivot(index="date", columns="hour", values="kWh")
Out[29]:
hour 0 1 2 3 4 5 6 7 8 9 ... \
date ...
2017-07-08 1.0 2.0 3.0 4.0 5.0 6.0 7.0 8.0 9.0 10.0 ...
2017-07-09 25.0 26.0 27.0 28.0 29.0 30.0 31.0 32.0 33.0 34.0 ...
2017-07-10 49.0 50.0 51.0 52.0 53.0 54.0 55.0 56.0 57.0 58.0 ...
2017-07-11 73.0 74.0 75.0 76.0 77.0 78.0 79.0 80.0 81.0 82.0 ...
2017-07-12 97.0 NaN NaN NaN NaN NaN NaN NaN NaN NaN ...
hour 14 15 16 17 18 19 20 21 22 23
date
2017-07-08 15.0 16.0 17.0 18.0 19.0 20.0 21.0 22.0 23.0 24.0
2017-07-09 39.0 40.0 41.0 42.0 43.0 44.0 45.0 46.0 47.0 48.0
2017-07-10 63.0 64.0 65.0 66.0 67.0 68.0 69.0 70.0 71.0 72.0
2017-07-11 87.0 88.0 89.0 90.0 91.0 92.0 93.0 94.0 95.0 96.0
2017-07-12 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
[5 rows x 24 columns]
Not sure why your MultiIndex code doesn't work.
I'm assuming your MultiIndex code is something along the lines, which gives the same output as the pivot:
In []
df = pd.DataFrame({"kWh": [1]}, index=pd.date_range("2017-07-08",
"2017-07-12", freq="1H").rename("Timestamp")).cumsum()
df.index = pd.MultiIndex.from_arrays([df.index.date, df.index.hour], names=['Date','Hour'])
df.unstack()
Out[]:
kWh ... \
Hour 0 1 2 3 4 5 6 7 8 9 ...
Date ...
2017-07-08 1.0 2.0 3.0 4.0 5.0 6.0 7.0 8.0 9.0 10.0 ...
2017-07-09 25.0 26.0 27.0 28.0 29.0 30.0 31.0 32.0 33.0 34.0 ...
2017-07-10 49.0 50.0 51.0 52.0 53.0 54.0 55.0 56.0 57.0 58.0 ...
2017-07-11 73.0 74.0 75.0 76.0 77.0 78.0 79.0 80.0 81.0 82.0 ...
2017-07-12 97.0 NaN NaN NaN NaN NaN NaN NaN NaN NaN ...
Hour 14 15 16 17 18 19 20 21 22 23
Date
2017-07-08 15.0 16.0 17.0 18.0 19.0 20.0 21.0 22.0 23.0 24.0
2017-07-09 39.0 40.0 41.0 42.0 43.0 44.0 45.0 46.0 47.0 48.0
2017-07-10 63.0 64.0 65.0 66.0 67.0 68.0 69.0 70.0 71.0 72.0
2017-07-11 87.0 88.0 89.0 90.0 91.0 92.0 93.0 94.0 95.0 96.0
2017-07-12 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
[5 rows x 24 columns]
​

Categories

Resources