I have 2 DataFrames, df1, and df2.
df1 has the following contents:
Adj Close Close High Low \
GBTC QQQ GBTC QQQ GBTC QQQ GBTC
Date
2019-01-29 4.02 159.342209 4.02 161.570007 4.07 163.240005 3.93
2019-01-30 4.06 163.395538 4.06 165.679993 4.09 166.279999 4.01
2019-01-31 3.99 165.841370 3.99 168.160004 4.06 168.990005 3.93
2019-02-01 4.02 165.141129 4.02 167.449997 4.07 168.600006 3.93
2019-02-04 3.96 167.192474 3.96 169.529999 4.00 169.529999 3.93
... ... ... ... ... ... ... ...
2019-02-25 4.65 171.127441 4.65 173.520004 4.78 174.660004 4.50
2019-02-26 4.36 171.304947 4.36 173.699997 4.74 174.250000 4.36
2019-02-27 4.30 171.196487 4.30 173.589996 4.50 173.800003 4.30
2019-02-28 4.46 170.802002 4.46 173.190002 4.65 173.809998 4.40
2019-03-01 4.58 171.985443 4.58 174.389999 4.64 174.649994 4.45
Open Volume
QQQ GBTC QQQ GBTC QQQ
Date
2019-01-29 160.990005 3.970 163.199997 975200 30784200
2019-01-30 162.889999 4.035 163.399994 770700 41346500
2019-01-31 166.470001 4.040 166.699997 1108700 37258400
2019-02-01 166.990005 4.000 167.330002 889100 32143700
2019-02-04 167.330002 3.990 167.479996 871800 26718800
... ... ... ... ... ...
2019-02-25 173.399994 4.625 174.210007 2891200 32608800
2019-02-26 172.809998 4.625 173.100006 2000100 21939700
2019-02-27 171.759995 4.400 172.899994 1537000 25162000
2019-02-28 172.699997 4.420 173.050003 1192600 25085500
2019-03-01 173.179993 4.470 174.440002 948500 31431200
[23 rows x 12 columns]
And here's the contents of df2:
Adj Close Close High Low \
GBTC QQQ GBTC QQQ GBTC QQQ GBTC
Date
2019-02-25 4.65 171.127441 4.65 173.520004 4.78 174.660004 4.50
2019-02-26 4.36 171.304947 4.36 173.699997 4.74 174.250000 4.36
2019-02-27 4.30 171.196487 4.30 173.589996 4.50 173.800003 4.30
2019-02-28 4.46 170.802002 4.46 173.190002 4.65 173.809998 4.40
2019-03-01 4.58 171.985443 4.58 174.389999 4.64 174.649994 4.45
... ... ... ... ... ... ... ...
2019-03-28 4.54 176.171432 4.54 178.309998 4.68 178.979996 4.51
2019-03-29 4.78 177.505249 4.78 179.660004 4.83 179.830002 4.55
2019-04-01 4.97 179.856705 4.97 182.039993 5.03 182.259995 4.85
2019-04-02 5.74 180.538437 5.74 182.729996 5.83 182.910004 5.52
2019-04-03 6.19 181.575836 6.19 183.779999 6.59 184.919998 5.93
Open Volume
QQQ GBTC QQQ GBTC QQQ
Date
2019-02-25 173.399994 4.625 174.210007 2891200 32608800
2019-02-26 172.809998 4.625 173.100006 2000100 21939700
2019-02-27 171.759995 4.400 172.899994 1537000 25162000
2019-02-28 172.699997 4.420 173.050003 1192600 25085500
2019-03-01 173.179993 4.470 174.440002 948500 31431200
... ... ... ... ... ...
2019-03-28 177.240005 4.650 178.360001 2104400 30368200
2019-03-29 178.589996 4.710 179.690002 2937400 35205500
2019-04-01 180.770004 4.850 181.509995 2733600 30969500
2019-04-02 181.779999 5.660 182.240005 6062000 22645200
2019-04-03 183.210007 5.930 183.759995 10002400 31633500
[28 rows x 12 columns]
As you can see from the above, df1 and df2 have overlapping Dates.
How can I create a merged DataFrame df that contains dates from 2019-01-29 to 2019-04-03 with no overlapping Date?
I've tried running df = df1.merge(df2, how='outer'). However, this command returns a DataFrame with Date removed, which is not something desirable.
> df
Adj Close Close High Low \
GBTC QQQ GBTC QQQ GBTC QQQ GBTC
0 4.02 159.342209 4.02 161.570007 4.07 163.240005 3.93
1 4.06 163.395538 4.06 165.679993 4.09 166.279999 4.01
2 3.99 165.841370 3.99 168.160004 4.06 168.990005 3.93
3 4.02 165.141129 4.02 167.449997 4.07 168.600006 3.93
4 3.96 167.192474 3.96 169.529999 4.00 169.529999 3.93
.. ... ... ... ... ... ... ...
41 4.54 176.171432 4.54 178.309998 4.68 178.979996 4.51
42 4.78 177.505249 4.78 179.660004 4.83 179.830002 4.55
43 4.97 179.856705 4.97 182.039993 5.03 182.259995 4.85
44 5.74 180.538437 5.74 182.729996 5.83 182.910004 5.52
45 6.19 181.575836 6.19 183.779999 6.59 184.919998 5.93
Open Volume
QQQ GBTC QQQ GBTC QQQ
0 160.990005 3.970 163.199997 975200 30784200
1 162.889999 4.035 163.399994 770700 41346500
2 166.470001 4.040 166.699997 1108700 37258400
3 166.990005 4.000 167.330002 889100 32143700
4 167.330002 3.990 167.479996 871800 26718800
.. ... ... ... ... ...
41 177.240005 4.650 178.360001 2104400 30368200
42 178.589996 4.710 179.690002 2937400 35205500
43 180.770004 4.850 181.509995 2733600 30969500
44 181.779999 5.660 182.240005 6062000 22645200
45 183.210007 5.930 183.759995 10002400 31633500
[46 rows x 12 columns]
It seems that I should find a way to merge df1.index and df2.index. Then add the merged DatetimeIndex to df.
For the convenience of debugging, you can run the following code to get the same data as mine.
import yfinance as yf
symbols = ['QQQ', 'GBTC']
df1 = yf.download(symbols, start="2019-01-29", end="2019-03-01")
df2 = yf.download(symbols, start="2019-02-25", end="2019-04-03")
Taken from the docs:
The join is done on columns or indexes. If joining columns on columns, the DataFrame indexes will be ignored. Otherwise if joining indexes on indexes or indexes on a column or columns, the index will be passed on. When performing a cross merge, no column specifications to merge on are allowed.
So I believe that if you specify the index in the merge with on=Date, then you should be ok.
df1.merge(df2, how='outer', on='Date')
However, for the problem that you are trying to solve merge is note the correct tool. What you need to do is append the dataframes together and then remove the duplicated days:
df1.append(df2).drop_duplicates()
Related
print(df.sample(50)):
match_datetime country league home_team away_team home_odds draw_odds away_odds run_time home_score away_score
72170 2021-10-17 12:30:00 Ukraine Persha Liga Alliance Uzhhorod 1.22 5.62 9.71 2021-10-17 09:22:20.212731 NaN NaN
100398 2021-11-02 14:35:00 Saudi Arabia Division 1 Al Qadisiya Bisha 1.61 3.61 4.94 2021-11-02 09:13:18.768604 2.0 1.0
33929 2021-09-11 23:00:00 Panama LPF Veraguas Plaza Amador 2.75 2.75 2.71 2021-09-10 23:47:54.682982 1.0 1.0
12328 2021-08-15 15:30:00 Poland Ekstraklasa Slask Wroclaw Leczna 1.74 3.74 4.59 2021-08-14 22:44:26.136608 0.0 0.0
81500 2021-10-24 13:00:00 Italy Serie D - Group A Caronnese Saluzzo 1.69 3.60 4.28 2021-10-23 13:37:16.920175 2.0 2.0
143370 2021-12-05 14:00:00 Poland Division 1 Chrobry Glogow Widzew Lodz 3.36 3.17 2.15 2021-11-30 17:40:24.833519 0.0 0.0
175061 2022-01-08 18:00:00 Spain Primera RFEF - Group 1 R. Union Extremadura UD 1.26 4.40 18.00 2022-01-08 17:00:46.662761 0.0 1.0
21293 2021-08-29 16:00:00 Italy Serie B Cittadella Crotone 2.32 3.11 3.31 2021-08-26 18:04:46.221393 4.0 2.0
97427 2021-11-01 17:00:00 Israel Leumit League M. Nazareth Beitar Tel Aviv 1.92 3.26 3.75 2021-10-30 09:40:08.966330 4.0 2.0
177665 2022-01-13 12:30:00 Egypt Division 2 - Group C Said El Mahalla Al Magd 4.12 3.08 1.94 2022-01-12 17:53:33.570126 0.0 0.0
69451 2021-10-17 05:00:00 South Korea K League 1 Gangwon Gwangju FC 2.06 3.38 3.65 2021-10-15 09:55:54.578112 NaN NaN
4742 2021-08-10 20:30:00 Peru Liga 2 Deportivo Coopsol Grau 3.14 3.49 2.06 2021-08-10 18:14:01.996860 0.0 2.0
22266 2021-08-29 13:00:00 France Ligue 1 Angers Rennes 2.93 3.27 2.56 2021-08-27 12:26:34.904374 2.0 0.0
46412 2021-09-26 04:00:00 Japan J2 League Okayama Blaublitz 2.24 2.90 3.63 2021-09-23 09:08:26.979783 1.0 1.0
133207 2021-11-27 21:15:00 Bolivia Division Profesional Palmaflor Blooming 1.51 4.05 5.10 2021-11-25 18:22:28.275844 3.0 0.0
140825 2021-11-28 11:00:00 Spain Tercera RFEF - Group 6 Valencia B Torrellano 1.58 3.56 5.26 2021-11-28 19:54:40.066637 2.0 0.0
226985 2022-03-04 00:30:00 Argentina Copa de la Liga Profesional Central Cordoba Rosario Central 2.36 3.26 2.86 2022-03-02 17:23:10.014424 0.0 1.0
137226 2021-11-28 12:45:00 Greece Super League 2 Apollon Pontou PAOK B 3.37 3.25 2.01 2021-11-27 15:13:05.937815 0.0 3.0
182756 2022-01-22 10:30:00 Turkey 1. Lig Umraniyespor Menemenspor 1.40 4.39 7.07 2022-01-19 17:25:27.128331 2.0 1.0
89895 2021-10-28 16:45:00 Netherlands KNVB Beker Ajax Cambuur 9.10 5.55 1.26 2021-10-27 07:46:56.253996 0.0 5.0
227595 2022-03-06 17:00:00 Israel Ligat ha'Al Ashdod Maccabi Petah Tikva 2.30 3.21 3.05 2022-03-02 17:23:10.014424 NaN NaN
57568 2021-10-02 13:00:00 Estonia Meistriliiga Kalju Legion 1.58 4.10 4.84 2021-10-02 10:55:35.287359 2.0 2.0
227035 2022-03-04 19:00:00 Denmark Superliga FC Copenhagen Randers FC 1.70 3.84 5.06 2022-03-02 17:23:10.014424 NaN NaN
108668 2021-11-07 13:30:00 Germany Oberliga Mittelrhein Duren Freialdenhoven 1.35 5.20 6.35 2021-11-06 17:37:37.629603 2.0 0.0
86270 2021-10-25 18:00:00 Belgium Pro League U21 Lommel SK U21 Lierse K. U21 3.23 3.84 1.92 2021-10-26 01:22:31.111441 0.0 0.0
89437 2021-11-01 02:10:00 Colombia Primera A America De Cali Petrolera 1.86 2.92 4.60 2021-10-27 07:41:24.427246 NaN NaN
13986 2021-08-21 13:00:00 France Ligue 2 Dijon Toulouse 3.92 3.51 1.94 2021-08-16 13:22:02.749887 2.0 4.0
105179 2021-11-06 15:00:00 England NPL Premier Division Atherton South Shields 3.90 3.42 1.82 2021-11-05 10:01:28.567328 1.0 1.0
142821 2021-12-01 12:30:00 Bulgaria Vtora liga Marek Septemvri Simitli 1.79 3.38 4.35 2021-11-30 17:40:24.833519 2.0 2.0
45866 2021-09-24 00:30:00 Venezuela Primera Division Dep. Tachira Portuguesa 1.96 3.60 3.22 2021-09-23 09:08:26.979783 4.0 1.0
76100 2021-10-22 16:30:00 Denmark 1st Division Hvidovre IF Koge 1.91 3.56 3.81 2021-10-21 08:43:12.445245 NaN NaN
115896 2021-11-14 16:00:00 Spain Tercera RFEF - Group 6 Olimpic Xativa Torrellano 2.78 2.89 2.39 2021-11-13 12:21:45.955738 1.0 0.0
156159 2021-12-12 16:00:00 Spain Segunda RFEF - Group 1 Marino de Luanco Coruxo FC 2.19 3.27 3.07 2021-12-10 09:26:45.001977 0.0 0.0
18240 2021-08-21 12:00:00 Germany Regionalliga West Rodinghausen Fortuna Koln 3.25 3.60 2.00 2021-08-21 03:30:43.193978 NaN NaN
184913 2022-01-22 10:00:00 World Club Friendly Zilina B Trinec 3.56 4.14 1.78 2022-01-22 16:44:32.650325 0.0 3.0
16782 2021-08-22 23:05:00 Colombia Primera A Petrolera Dep. Cali 3.01 3.00 2.44 2021-08-19 18:24:24.966505 2.0 3.0
63847 2021-10-10 09:30:00 Spain Tercera RFEF - Group 7 Carabanchel RSD Alcala 4.39 3.42 1.75 2021-10-09 12:03:50.720013 NaN NaN
7254 2021-08-12 16:45:00 Europe Europa Conference League Hammarby Cukaricki 1.72 3.87 4.13 2021-08-11 23:48:31.958394 NaN NaN
82727 2021-10-24 14:00:00 Lithuania I Lyga Zalgiris 2 Neptunas 1.76 3.78 3.35 2021-10-24 12:02:06.306279 1.0 3.0
43074 2021-09-22 18:00:00 Ukraine Super Cup Shakhtar Donetsk Dyn. Kyiv 2.57 3.49 2.59 2021-09-19 09:39:56.624504 NaN NaN
65187 2021-10-11 18:45:00 World World Cup Norway Montenegro 1.56 4.17 6.28 2021-10-11 10:56:09.973470 NaN NaN
120993 2021-11-18 00:00:00 USA NISA Maryland Bobcats California Utd. 2.76 3.23 2.39 2021-11-17 20:36:26.562731 1.0 1.0
201469 2022-02-12 15:00:00 England League One AFC Wimbledon Sunderland 3.30 3.48 2.17 2022-02-10 17:47:36.501159 1.0 1.0
142180 2021-12-01 19:45:00 Scotland Premiership St. Mirren Ross County 2.06 3.25 3.85 2021-11-29 18:28:22.249662 0.0 0.0
4681 2021-08-10 18:30:00 Europe Champions League Young Boys CFR Cluj 1.48 4.29 6.92 2021-08-10 18:14:01.996860 3.0 1.0
67321 2021-10-17 13:00:00 Spain LaLiga Rayo Vallecano Elche 1.78 3.64 4.99 2021-10-13 11:22:34.979378 NaN NaN
27499 2021-09-04 14:00:00 Iceland Inkasso-deildin Kordrengir Fjolnir 2.18 3.66 2.82 2021-09-02 23:28:49.414126 1.0 4.0
48962 2021-09-25 21:00:00 Mexico Liga Premier Serie B Uruapan Lobos Huerta 1.83 3.69 3.70 2021-09-25 13:02:58.238466 NaN NaN
65636 2021-10-16 17:00:00 Switzerland Super League Young Boys Luzern 1.26 6.04 9.43 2021-10-11 10:56:09.973470 NaN NaN
17333 2021-08-21 14:00:00 Finland Kakkonen Group A Atlantis Kiffen 1.57 4.29 4.42 2021-08-20 12:41:03.159846 1.0 1.0
I am trying to get the latest 2 match_datetime values for every run_time and then filter(join) df to get all the relevant values as below:
df['match_datetime'] = pd.to_datetime(df['match_datetime'])
s = (df['match_datetime'].dt.normalize()
.groupby([df['run_time']])
.value_counts()
.groupby(level=0)
.head(2))
print(s)
run_time match_datetime
2021-08-07 00:04:36.326391 2021-08-07 255
2021-08-06 188
2021-08-07 10:50:34.574040 2021-08-07 649
2021-08-08 277
2021-08-07 16:56:22.322338 2021-08-07 712
This returns a series while I want a DataFrame so I can merge.
To do this:
df_n = df.reset_index().merge(s, how="left",
left_on=["match_datetime", "run_time"],
right_on=["match_datetime", "run_time"])
While I am sure there is a better manner I can write function s but I am unsure how to do it the correct way.
If I understand correctly, you would like to filter the dataframe to retain, for each run_time, the last two rows (or up to two rows) by match_datetime.
Simplified answer
This can be done relatively easily without any join, using GroupBy.tail(). (Note, my original answer was using GroupBy.rank(), but this is simpler, although slower):
out = df.sort_values(
['run_time', 'match_datetime']
).groupby('run_time').tail(2)
Minimal example
import numpy as np
np.random.seed(0)
n = 10
rt = np.random.choice(pd.date_range('2022-01-01', periods=n//2, freq='D'), n)
df = pd.DataFrame({
'run_time': rt,
'match_datetime': rt - pd.to_timedelta(np.random.uniform(size=n), unit='days'),
})
df['match_datetime'] = df['match_datetime'].dt.round('h')
Then:
out = df.sort_values(
['run_time', 'match_datetime']
).groupby('run_time').tail(2)
>>> out
run_time match_datetime
9 2022-01-01 2021-12-31 04:00:00
1 2022-01-01 2021-12-31 15:00:00
5 2022-01-02 2022-01-01 02:00:00
7 2022-01-03 2022-01-02 22:00:00
3 2022-01-04 2022-01-03 11:00:00
6 2022-01-04 2022-01-03 22:00:00
0 2022-01-05 2022-01-04 01:00:00
8 2022-01-05 2022-01-05 00:00:00
On the OP's (extended) data
The output is quite verbose. Here is a sample:
>>> out['run_time match_datetime country draw_odds'.split()].head()
run_time match_datetime country draw_odds
4681 2021-08-10 18:14:01.996860 2021-08-10 18:30:00 Europe 4.29
4742 2021-08-10 18:14:01.996860 2021-08-10 20:30:00 Peru 3.49
7254 2021-08-11 23:48:31.958394 2021-08-12 16:45:00 Europe 3.87
12328 2021-08-14 22:44:26.136608 2021-08-15 15:30:00 Poland 3.74
13986 2021-08-16 13:22:02.749887 2021-08-21 13:00:00 France 3.51
Performance
For several millions of rows, the timing difference starts counting, and using rank is faster. Even faster, you can avoid sorting on run_time (the result is the same, but the rows are in a different order):
np.random.seed(0)
n = 1_000_000
rt = np.random.choice(pd.date_range('2022-01-01', periods=n//2, freq='min'), n)
df = pd.DataFrame({
'run_time': rt,
'match_datetime': rt - pd.to_timedelta(np.random.uniform(size=n), unit='s'),
})
%timeit df.sort_values(['run_time', 'match_datetime']).groupby('run_time').tail(2)
# 981 ms ± 14.4 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
%timeit df.loc[df.groupby('run_time')['match_datetime'].rank(method='first', ascending=False) <= 2]
# 355 ms ± 2.2 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
%timeit df.sort_values(['match_datetime']).groupby('run_time').tail(2)
# 258 ms ± 846 µs per loop (mean ± std. dev. of 7 runs, 1 loop each)
Check solutions:
a = df.sort_values(['run_time', 'match_datetime']).groupby('run_time').tail(2)
b = df.loc[df.groupby('run_time')['match_datetime'].rank(method='first', ascending=False) <= 2]
c = df.sort_values(['match_datetime']).groupby('run_time').tail(2)
by = ['run_time', 'match_datetime']
>>> a.sort_values(by).equals(b.sort_values(by))
True
>>> b.sort_values(by).equals(c.sort_values(by))
True
I have a Dataframe of raw data:
df
Out:
Date_time 10a 10b 10c 40a 40b 40c 100a 100b 100c
120 2019-02-04 16:00:00 26.7 26.9 NaN 26.7 NaN NaN 24.9 NaN NaN
121 2019-02-04 17:00:00 23.4 24.0 23.5 24.3 24.1 24.0 25.1 24.8 25.1
122 2019-02-04 18:00:00 23.1 24.0 23.3 24.3 24.1 24.0 25.1 24.8 25.1
123 2019-02-04 19:00:00 22.8 23.8 22.9 24.3 24.1 24.0 25.1 24.8 25.1
124 2019-02-04 20:00:00 NaN 23.5 22.6 24.3 24.1 24.0 25.1 24.8 25.1
I wish to create a DataFrame containing the 'Date_time' column and several columns of data means. In this instance there will be 3 means for each row, one each for 10, 40, and 100, calculating the mean values for a, b, and c for each of these numbered intervals.
means
Out:
Date_time 10cm 40cm 100cm
120 2019-02-04 16:00:00 26.800000 26.700000 24.9
121 2019-02-04 17:00:00 23.633333 24.133333 25.0
122 2019-02-04 18:00:00 23.466667 24.133333 25.0
123 2019-02-04 19:00:00 23.166667 24.133333 25.0
124 2019-02-04 20:00:00 23.050000 24.133333 25.0
I have tried the following (taken from this answer):
means = df['Date_time'].copy()
means['10cm'] = df.loc[:, '10a':'10c'].mean(axis=1)
But this results in all the mean values being clumped together in one cell at the bottom of the 'Date_time' column with '10cm' being given as the cell's index.
means
Out:
120 2019-02-04 16:00:00
121 2019-02-04 17:00:00
122 2019-02-04 18:00:00
123 2019-02-04 19:00:00
124 2019-02-04 20:00:00
10cm 120 26.800000
121 23.633333
122 23.46...
Name: Date_time, dtype: object
I believe that this is something to do with means being a Series object rather that a DataFrame object when I copy across the 'Date_time' column, but I'm not sure. Any pointers would be greatly appreciated!
It was the Series issue. Turns out writing out the question helped me realise the issue! My solution was altering the initial creation of means using to_frame():
means = df['Date_time'].copy().to_frame()
I'll leave the question up in case anyone else is having a similar issue, to save them having to spend time writing it all up!
I am analyzing data from excel file.
I want to create data frame by parsing data from excel using python.
Data in my excel file looks like as follow:
The first row highlighted in yellow contains match, which will be one of the columns in data frame that I wanted to create.
In fact, second row and 4th row are the name of the columns that I wanted to created in a new data frame.
3rd row and fifth row are the value of each column.
The sample here is only for one match.
I have multiple matches in the excel file.
I want to create a data frame that contain the column Match and all name in blue colors in the file.
I have attached the sample file that contains multiple matches.
Download the file here.
My expected data frame is
Match 1-0 2-0 2-1 3-0 3-1 3-2 4-0 4-1 4-2 4-3.......
MOL Vivi -vs- Chelsea 14 42 20 170 85 85 225 225 225 .....
Can anyone advise me how to parse the excel data and convert to data frame?
Thanks,
Zep
Use:
import pandas as pd
from datetime import datetime
df = pd.read_excel('test_match.xlsx')
#mask for check a-z in column HOME -vs- AWAY
m1 = df['HOME -vs- AWAY'].str.contains('[a-z]', na=False)
#create index by matches
df.index = df['HOME -vs- AWAY'].where(m1).ffill()
df.index.name = 'Match'
#remove same index and HOME -vs- AWAY column rows
df = df[df.index != df['HOME -vs- AWAY']].copy()
#test if datetime or string
m2 = df['HOME -vs- AWAY'].apply(lambda x: isinstance(x, datetime))
m3 = df['HOME -vs- AWAY'].apply(lambda x: isinstance(x, str))
#seelct next rows and set new columns names
df1 = df[m2.shift().fillna(False)]
df1.columns = df[m2].iloc[0]
#also remove only NaNs columns
df2 = df[m3.shift().fillna(False)].dropna(axis=1, how='all')
df2.columns = df[m3].iloc[0].dropna()
#join together
df = pd.concat([df1, df2], axis=1).astype(float).reset_index().rename_axis(None, axis=1)
print (df.head())
Match 2000-01-01 00:00:00 2000-02-01 00:00:00 \
0 MOL Vidi -vs- Chelsea 14.00 42.00
1 Lazio -vs- Eintracht Frankfurt 8.57 11.55
2 Sevilla -vs- FC Krasnodar 7.87 6.63
3 Villarreal -vs- Spartak Moscow 7.43 7.03
4 Rennes -vs- FC Astana 4.95 6.38
2018-02-01 00:00:00 2000-03-01 00:00:00 2018-03-01 00:00:00 \
0 20.00 170.00 85.00
1 7.87 23.80 15.55
2 7.87 8.72 8.65
3 7.07 10.00 9.43
4 7.33 12.00 13.20
2018-03-02 00:00:00 2000-04-01 00:00:00 2018-04-01 00:00:00 \
0 85.0 225.00 225.00
1 21.3 64.30 42.00
2 25.9 14.80 14.65
3 23.9 19.35 17.65
4 38.1 31.50 34.10
2018-04-02 00:00:00 ... 0-1 0-2 2018-01-02 00:00:00 \
0 225.0 ... 5.6 6.80 7.00
1 55.7 ... 11.0 19.05 10.45
2 38.1 ... 28.0 79.60 29.20
3 38.4 ... 20.9 58.50 22.70
4 81.4 ... 12.9 42.80 22.70
0-3 2018-01-03 00:00:00 2018-02-03 00:00:00 0-4 \
0 12.5 12.0 32.0 30.0
1 48.4 27.4 29.8 167.3
2 223.0 110.0 85.4 227.5
3 203.5 87.6 73.4 225.5
4 201.7 97.6 103.6 225.5
2018-01-04 00:00:00 2018-02-04 00:00:00 2018-03-04 00:00:00
0 29.0 60.0 220.0
1 91.8 102.5 168.3
2 227.5 227.5 227.5
3 225.5 225.5 225.5
4 225.5 225.5 225.5
[5 rows x 27 columns]
I took an excel sheet which has dates and some values and want to convert them to pandas dataframe and select only rows which are between certain dates.
For some reason I cannot select a row by date index
Raw Data in Excel file
MCU
Timestamp 50D 10P1 10P2 10P3 10P6 10P9 10P12
12-Feb-15 25.17 5.88 5.92 5.98 6.18 6.23 6.33
11-Feb-15 25.9 6.05 6.09 6.15 6.28 6.31 6.39
10-Feb-15 26.38 5.94 6.05 6.15 6.33 6.39 6.46
Code
xls = pd.ExcelFile('e:/Data.xlsx')
vols = xls.parse(asset.upper()+'VOL',header=1)
vols.set_index('Timestamp',inplace=True)
Data before set_index
Timestamp 50D 10P1 10P2 10P3 10P6 10P9 10P12 25P1 25P2 \
0 2015-02-12 25.17 5.88 5.92 5.98 6.18 6.23 6.33 2.98 3.08
1 2015-02-11 25.90 6.05 6.09 6.15 6.28 6.31 6.39 3.12 3.17
2 2015-02-10 26.38 5.94 6.05 6.15 6.33 6.39 6.46 3.01 3.16
Data after set_index
50D 10P1 10P2 10P3 10P6 10P9 10P12 25P1 25P2 25P3 \
Timestamp
2015-02-12 25.17 5.88 5.92 5.98 6.18 6.23 6.33 2.98 3.08 3.21
2015-02-11 25.90 6.05 6.09 6.15 6.28 6.31 6.39 3.12 3.17 3.32
2015-02-10 26.38 5.94 6.05 6.15 6.33 6.39 6.46 3.01 3.16 3.31
Output
>>> vols.index
<class 'pandas.tseries.index.DatetimeIndex'>
[2015-02-12, ..., NaT]
Length: 1478, Freq: None, Timezone: None
>>> vols[date(2015,2,12)]
*** KeyError: datetime.date(2015, 2, 12)
I would expect this not to fail, and also I should be able to select a range of dates. Tried so many combinations but not getting it.
Using a datetime.date instance to try to retrieve the index won't work, you just need a string representation of the date, e.g. '2015-02-12' or '2015/02/14'.
Secondly, vols[date(2015,2,12)] is actually looking in your DataFrame's column headings, not the index. You can use loc to fetch row index labels instead. For example you could write vols.loc['2015-02-12']
this is driving me nuts, I can't plot column 'b'
it plots only column 'A'.....
this is my code, no idea what I'm doing wrong, probably something silly...
the dataframe seems ok, weirdness also is that I can access both df['A'] and df['b'] but only df['A'].plot() works, if I issue a df['b'].plot() I get this error :
Traceback (most recent call last): File
"C:\Python27\lib\site-packages\IPython\core\interactiveshell.py", line
2883, in run_code
exec(code_obj, self.user_global_ns, self.user_ns) File "", line 1, in
df['b'].plot() File "C:\Python27\lib\site-packages\pandas\tools\plotting.py", line 2511,
in plot_series
**kwds) File "C:\Python27\lib\site-packages\pandas\tools\plotting.py", line 2317,
in _plot
plot_obj.generate() File "C:\Python27\lib\site-packages\pandas\tools\plotting.py", line 921, in
generate
self._compute_plot_data() File "C:\Python27\lib\site-packages\pandas\tools\plotting.py", line 997, in
_compute_plot_data
'plot'.format(numeric_data.class.name)) TypeError: Empty 'Series': no numeric data to plot
import sqlalchemy
import pandas as pd
import matplotlib.pyplot as plt
engine = sqlalchemy.create_engine(
'sqlite:///C:/Users/toto/PycharmProjects/my_db.sqlite')
tables = engine.table_names()
dic = {}
for t in tables:
sql = 'SELECT t."weight" FROM "' + t + '" t WHERE t."udl"="IBE SM"'
dic[t] = (pd.read_sql(sql, engine)['weight'][0], pd.read_sql(sql, engine)['weight'][1])
df = pd.DataFrame.from_dict(dic, orient='index').sort_index()
df = df.set_index(pd.DatetimeIndex(df.index))
df.columns = ['A', 'b']
print(df)
print(df.info())
df.plot()
plt.show()
this is the 2 print
A b
2014-08-05 1.81 3.39
2014-08-06 1.81 3.39
2014-08-07 1.81 3.39
2014-08-08 1.80 3.37
2014-08-11 1.79 3.35
2014-08-13 1.80 3.36
2014-08-14 1.80 3.35
2014-08-18 1.80 3.35
2014-08-19 1.79 3.34
2014-08-20 1.80 3.35
2014-08-27 1.79 3.35
2014-08-28 1.80 3.35
2014-08-29 1.79 3.35
2014-09-01 1.79 3.35
2014-09-02 1.79 3.35
2014-09-03 1.79 3.36
2014-09-04 1.79 3.37
2014-09-05 1.80 3.38
2014-09-08 1.79 3.36
2014-09-09 1.79 3.35
2014-09-10 1.78 3.35
2014-09-11 1.78 3.34
2014-09-12 1.78 3.34
2014-09-15 1.78 3.35
2014-09-16 1.78 3.35
2014-09-17 1.78 3.35
2014-09-18 1.78 3.34
2014-09-19 1.79 3.35
2014-09-22 1.79 3.36
2014-09-23 1.80 3.37
... ... ...
2014-12-10 1.73 3.29
2014-12-11 1.74 3.27
2014-12-12 1.74 3.25
2014-12-15 1.74 3.24
2014-12-16 1.74 3.27
2014-12-17 1.75 3.28
2014-12-18 1.76 3.29
2014-12-19 1.04 1.39
2014-12-22 1.04 1.39
2014-12-23 1.04 1.4
2014-12-24 1.04 1.39
2014-12-29 1.04 1.39
2014-12-30 1.04 1.4
2015-01-02 1.04 1.4
2015-01-05 1.04 1.4
2015-01-06 1.04 1.4
2015-01-07 NaN 1.39
2015-01-08 NaN 1.39
2015-01-09 NaN 1.39
2015-01-12 NaN 1.38
2015-01-13 NaN 1.38
2015-01-14 NaN 1.38
2015-01-15 NaN 1.38
2015-01-16 NaN 1.38
2015-01-19 NaN 1.39
2015-01-20 NaN 1.38
2015-01-21 NaN 1.39
2015-01-22 NaN 1.4
2015-01-23 NaN 1,4
2015-01-26 NaN 1.41
[107 rows x 2 columns]
<class 'pandas.core.frame.DataFrame'>
DatetimeIndex: 107 entries, 2014-08-05 00:00:00 to 2015-01-26 00:00:00
Data columns (total 2 columns):
A 93 non-null float64
b 107 non-null object
dtypes: float64(1), object(1)
memory usage: 2.1+ KB
None
Process finished with exit code 0
just got it, 'b' is of object type and not float64 because of this line :
2015-01-23 NaN 1,4