how to convert a dictionary to pandas dataframe? - python

I want this output which is in dict to be converted into a pandas DataFrame with few columns of interest. Note my output is actually more but I have posted only part of the output but I hope you can understand what I actually want
My dataframe should have columns `['Date','Change in OI','Open interest']. preferably Date should be index.
strikes=[690,700,710]
data={}
for s in strikes:
data[s]=get_history(symbol="CIPLA",
start=date(2020,7,1),
end=date(2201,7,17),
option_type="CE",
strike_price=s,
expiry_date=date(2020,7,30))
OUTPUT
{690: Symbol Expiry Option Type Strike Price Open High Low \
Date
2020-07-01 CIPLA 2020-07-30 CE 690.0 11.50 11.50 7.70
2020-07-02 CIPLA 2020-07-30 CE 690.0 8.90 20.90 8.50
2020-07-03 CIPLA 2020-07-30 CE 690.0 17.75 17.75 12.00
2020-07-06 CIPLA 2020-07-30 CE 690.0 11.30 11.30 9.60
2020-07-07 CIPLA 2020-07-30 CE 690.0 10.70 12.25 10.60
2020-07-08 CIPLA 2020-07-30 CE 690.0 12.95 14.10 11.45
2020-07-09 CIPLA 2020-07-30 CE 690.0 14.00 14.00 11.60
2020-07-10 CIPLA 2020-07-30 CE 690.0 12.50 13.00 10.95
2020-07-13 CIPLA 2020-07-30 CE 690.0 11.10 11.65 9.65
2020-07-14 CIPLA 2020-07-30 CE 690.0 10.65 10.70 8.40
2020-07-15 CIPLA 2020-07-30 CE 690.0 7.55 10.00 7.55
2020-07-16 CIPLA 2020-07-30 CE 690.0 11.20 18.65 7.25
2020-07-17 CIPLA 2020-07-30 CE 690.0 18.85 25.75 14.80
Close Last Settle Price Number of Contracts Turnover \
Date
2020-07-01 8.85 8.85 8.85 66 5.995900e+07
2020-07-02 16.85 20.50 16.85 68 6.235500e+07
2020-07-03 13.00 13.25 13.00 117 1.069840e+08
2020-07-06 10.65 10.75 10.65 76 6.918800e+07
2020-07-07 11.00 11.00 11.00 64 5.836300e+07
2020-07-08 11.95 11.95 11.95 84 7.674300e+07
2020-07-09 12.00 12.00 12.00 25 2.284000e+07
2020-07-10 11.10 11.10 11.10 50 4.564100e+07
2020-07-13 10.05 10.05 10.05 36 3.278000e+07
2020-07-14 8.50 8.40 8.50 39 3.546700e+07
2020-07-15 8.45 8.40 8.45 31 2.816200e+07
2020-07-16 17.20 16.80 17.20 803 7.350000e+08
2020-07-17 20.05 19.30 20.05 1708 1.577693e+09
Premium Turnover Open Interest Change in OI Underlying
Date
2020-07-01 757000.0 119600 5200 NaN
2020-07-02 1359000.0 113100 -6500 646.20
2020-07-03 2035000.0 131300 18200 638.80
2020-07-06 1016000.0 123500 -7800 NaN
2020-07-07 955000.0 128700 5200 636.55
2020-07-08 1395000.0 130000 1300 NaN
2020-07-09 415000.0 130000 0 NaN
2020-07-10 791000.0 130000 0 NaN
2020-07-13 488000.0 123500 -6500 NaN
2020-07-14 484000.0 115700 -7800 NaN
2020-07-15 355000.0 124800 9100 NaN
2020-07-16 14709000.0 302900 178100 NaN
2020-07-17 45617000.0 243100 -59800 689.10 }
the same goes on for the values in strikes=[700,710] in output
Already tried using pd.DataFrame.from_dict(data) no use

You can try the following approaches
1.
import pandas as pd
df = pd.DatFrame(<<your dictionary>>) # you can pass the dictionary
Alternatively you can also use the following
import pandas as pd
cols = [<<list of column names>>] # to specify different column names
df = pd.DataFrame.from_dict(<<dictionary name>>,columns=cols)

use your dictionary in this way to convert it into a dataframe
import pandas as pd
df = pd.DataFrame({"your dictionary here"})

Related

Copying table using pd.read_html

Using pd.read_html in python, I am trying to copy a table from the following website:
https://finance.naver.com/sise/investorDealTrendDay.nhn?bizdate=215600&sosok=&page=2
import pandas as pd
df = pd.DataFrame()
df = df.append(pd.read_html(pg_url, header=0)[0], ignore_index=False)
Yet, I can't copy the numbers for some reason...
I'd appreciate your help on figuring out what went wrong
For me working well, remove header=0 and then only NaNs rows:
url ='https://finance.naver.com/sise/investorDealTrendDay.nhn?bizdate=215600&sosok=&page=2'
df = pd.read_html(url)[0].dropna(how='all')
print (df)
날짜 개인 외국인 기관계 기관 \
날짜 개인 외국인 기관계 금융투자 보험 투신(사모) 은행 기타금융기관
0 20.08.06 -850.0 1638.0 -801.0 2247.0 -517.0 -993.0 46.0 -138.0
1 20.08.05 4315.0 -516.0 -3666.0 -1277.0 -441.0 -871.0 -18.0 -30.0
2 20.08.04 1844.0 -583.0 -1488.0 392.0 -493.0 -205.0 14.0 -54.0
3 20.08.03 6237.0 -2687.0 -3795.0 -2841.0 -108.0 -411.0 0.0 -5.0
4 20.07.31 4716.0 -556.0 -3861.0 -2659.0 -129.0 -709.0 -7.0 -4.0
8 20.07.30 64.0 2247.0 -2342.0 423.0 -171.0 -428.0 -3.0 -13.0
9 20.07.29 476.0 2936.0 -3368.0 -1346.0 -296.0 -698.0 -8.0 -92.0
10 20.07.28 -10495.0 13060.0 -2220.0 -1440.0 -526.0 318.0 12.0 -76.0
11 20.07.27 -2996.0 1584.0 1395.0 1968.0 -20.0 161.0 -179.0 -58.0
12 20.07.24 2881.0 876.0 -3678.0 -1173.0 -545.0 -843.0 -43.0 -8.0
기타법인
연기금등 기타법인
0 -1446.0 13.0
1 -1029.0 -133.0
2 -1142.0 227.0
3 -429.0 246.0
4 -352.0 -299.0
8 -2151.0 30.0
9 -929.0 -44.0
10 -507.0 -345.0
11 -476.0 16.0
12 -1066.0 -79.0
If need first column to index and then to DatetimeIndex:
url ='https://finance.naver.com/sise/investorDealTrendDay.nhn?bizdate=215600&sosok=&page=2'
df = pd.read_html(url, index_col=0)[0].dropna(how='all')
df.index = pd.to_datetime(df.index, format='%y.%m.%d')
print (df)
날짜 개인 외국인 기관계 기관 \
날짜 개인 외국인 기관계 금융투자 보험 투신(사모) 은행 기타금융기관
2020-08-06 -850.0 1638.0 -801.0 2247.0 -517.0 -993.0 46.0 -138.0
2020-08-05 4315.0 -516.0 -3666.0 -1277.0 -441.0 -871.0 -18.0 -30.0
2020-08-04 1844.0 -583.0 -1488.0 392.0 -493.0 -205.0 14.0 -54.0
2020-08-03 6237.0 -2687.0 -3795.0 -2841.0 -108.0 -411.0 0.0 -5.0
2020-07-31 4716.0 -556.0 -3861.0 -2659.0 -129.0 -709.0 -7.0 -4.0
2020-07-30 64.0 2247.0 -2342.0 423.0 -171.0 -428.0 -3.0 -13.0
2020-07-29 476.0 2936.0 -3368.0 -1346.0 -296.0 -698.0 -8.0 -92.0
2020-07-28 -10495.0 13060.0 -2220.0 -1440.0 -526.0 318.0 12.0 -76.0
2020-07-27 -2996.0 1584.0 1395.0 1968.0 -20.0 161.0 -179.0 -58.0
2020-07-24 2881.0 876.0 -3678.0 -1173.0 -545.0 -843.0 -43.0 -8.0
날짜 기타법인
날짜 연기금등 기타법인
2020-08-06 -1446.0 13.0
2020-08-05 -1029.0 -133.0
2020-08-04 -1142.0 227.0
2020-08-03 -429.0 246.0
2020-07-31 -352.0 -299.0
2020-07-30 -2151.0 30.0
2020-07-29 -929.0 -44.0
2020-07-28 -507.0 -345.0
2020-07-27 -476.0 16.0
2020-07-24 -1066.0 -79.0

Creating a new column based on conditions

I have a dataframe with ID, START and END time stamp and another reference table with ID, TIME and WEIGHT columns. Now, I am trying to assign the weights to the df1 based on times.
If the time of df2 is in between start and end of df1 the corresponding weight should be assigned for the record in df1. I can simply use a left join but the problem is there might be two or 3 weights assigned for the same ID
df1:
ID START END
2591642409 2018-08-20 06:00:00 2018-08-20 16:59:59
2591642409 2018-08-20 17:00:00 2018-08-21 01:59:59
2591642409 2018-08-21 02:00:00 2018-08-21 14:59:59
2591642409 2018-08-21 15:00:00 2018-08-21 15:59:59
2591642409 2018-08-21 15:00:00 2018-08-21 15:59:59
2591642409 2018-08-21 15:00:00 2018-08-21 14:59:59
2591642409 2018-08-21 15:00:00 2018-08-21 14:59:59
2591642409 2018-08-21 16:00:00 2018-08-25 11:59:59
2626784515 2018-09-12 12:41:00 2018-09-12 17:59:59
2626784515 2018-09-12 18:00:00 2018-09-12 22:27:59
2626784515 2018-09-12 22:28:00 2018-09-13 23:32:59
2626784515 2018-09-14 00:00:00 2018-09-13 23:59:59
2631776057 2018-09-16 03:29:00 2018-09-16 12:39:59
2631776057 2018-09-16 12:40:00 2018-09-16 13:33:59
2631776057 2018-09-16 13:34:00 2018-09-16 14:10:59
2694817807 2018-10-31 10:30:00 2018-11-01 15:57:59
2694817807 2018-11-01 15:58:00 2018-11-02 22:59:59
2694817807 2018-11-02 23:00:00 2018-11-02 23:55:59
2694817807 2018-11-02 23:56:00 2018-11-09 00:18:59
2694817807 2018-11-09 00:19:00 2018-11-09 05:55:59
2694817807 2018-11-09 05:56:00 2018-11-09 08:34:59
2694817807 2018-11-09 08:35:00 2018-11-09 16:59:59
2694817807 2018-11-09 17:00:00 2018-11-10 04:29:59
2694817807 2018-11-10 04:30:00 2018-11-10 09:23:59
2694817807 2018-11-10 09:24:00 2018-11-11 03:09:59
2694817807 2018-11-11 03:10:00 2018-11-11 16:54:59
2694817807 2018-11-11 16:55:00 2018-11-11 20:55:59
2694817807 2018-11-11 20:56:00 2018-11-12 19:59:59
2711413129 2018-11-12 20:00:00 2018-11-13 04:20:59
df2:
ID TIME WEIGHT
2591642409 2018-08-15 01:42:13 3.38
2626784515 2018-09-12 14:56:03 3.7
2631776057 2018-09-16 07:05:45 3.7
2694817807 2018-10-31 14:21:54 4.5
2694817807 2018-11-09 05:29:52 4.8
2711413129 2018-11-12 17:14:26 4.8
Expected df:
ID START END WEIGHT
2591642409 2018-08-20 06:00:00 2018-08-20 16:59:59 3.38
2591642409 2018-08-20 17:00:00 2018-08-21 01:59:59 3.38
2591642409 2018-08-21 02:00:00 2018-08-21 14:59:59 3.38
2591642409 2018-08-21 15:00:00 2018-08-21 15:59:59 3.38
2591642409 2018-08-21 15:00:00 2018-08-21 15:59:59 3.38
2591642409 2018-08-21 15:00:00 2018-08-21 14:59:59 3.38
2591642409 2018-08-21 15:00:00 2018-08-21 14:59:59 3.38
2591642409 2018-08-21 16:00:00 2018-08-25 11:59:59 3.38
2626784515 2018-09-12 12:41:00 2018-09-12 17:59:59 3.7
2626784515 2018-09-12 18:00:00 2018-09-12 22:27:59 3.7
2626784515 2018-09-12 22:28:00 2018-09-13 23:32:59 3.7
2626784515 2018-09-14 00:00:00 2018-09-13 23:59:59 3.7
2631776057 2018-09-16 03:29:00 2018-09-16 12:39:59 3.7
2631776057 2018-09-16 12:40:00 2018-09-16 13:33:59 3.7
2631776057 2018-09-16 13:34:00 2018-09-16 14:10:59 3.7
2694817807 2018-10-31 10:30:00 2018-11-01 15:57:59 4.5
2694817807 2018-11-01 15:58:00 2018-11-02 22:59:59 4.5
2694817807 2018-11-02 23:00:00 2018-11-02 23:55:59 4.5
2694817807 2018-11-02 23:56:00 2018-11-09 00:18:59 4.5
2694817807 2018-11-09 00:19:00 2018-11-09 05:55:59 4.5
2694817807 2018-11-09 05:56:00 2018-11-09 08:34:59 4.8
2694817807 2018-11-09 08:35:00 2018-11-09 16:59:59 4.8
2694817807 2018-11-09 17:00:00 2018-11-10 04:29:59 4.8
2694817807 2018-11-10 04:30:00 2018-11-10 09:23:59 4.8
2694817807 2018-11-10 09:24:00 2018-11-11 03:09:59 4.8
2694817807 2018-11-11 03:10:00 2018-11-11 16:54:59 4.8
2694817807 2018-11-11 16:55:00 2018-11-11 20:55:59 4.8
2694817807 2018-11-11 20:56:00 2018-11-12 19:59:59 4.8
2711413129 2018-11-12 20:00:00 2018-11-13 04:20:59 4.8
I am using the following code
mask = (df2['TIME'] > df1['START']) & (df2['TIME'] < df1['END'])
df1['WEIGHTS'] = np.where(mask, df2['WEIGHTS'], '')
but it throws a value error saying
ValueError: Can only compare identically-labeled Series objects
I'd really appreciate if I can get some help.
You can't compare two series with different name from different dataframe in pandas. You have to either change the name or join the two dataframe. In this case, I believe join is the best choice. Since len(df1) != len(df2) After join the dataframe, you should able to use the code.
You could try first joining or merging the two frames first... then apply your filter
df1.set_index('ID', inplace=True)
df2.set_index('ID', inplace=True)
df = df1.join(df2)
df_filtered = df[(df['TIME'] > df['START']) & (df['TIME'] < df['END'])]

How to combine 3 separate columns of year(2 digit) ,month and day into single date column

I am combining 3 seoerate columns of year,month and day into a single column of my dataframe. But the year is in 2 digit which is giving error.
I have tried to_datetime() to do the same in jupyter notebook
Dataframe is in this form:
Yr Mo Dy RPT VAL ROS KIL SHA BIR DUB CLA MUL CLO BEL
61 1 1 15.04 14.96 13.17 9.29 NaN 9.87 13.67 10.25 10.83 12.58 18.50
61 1 2 14.71 NaN 10.83 6.50 12.62 7.67 11.50 10.04 9.79 9.67 17.54
61 1 3 18.50 16.88 12.33 10.13 11.17 6.17 11.25 NaN 8.50 7.67 12.75
data.rename(columns={'Yr':'Year','Mo':'Month','Dy':'Day'},inplace=True)
data['Date']=pd.to_datetime(data[['Year','Month','Day']],format='%y%m%d')
The error i am getting is:
cannot assemble the datetimes: time data 610101 does not match format '%Y%m%d' (match)
There is problem to_datetime with specify columns ['Year','Month','Day'] need YYYY format, so need alternative solution, because year is YY only:
s = data[['Yr','Mo','Dy']].astype(str).apply('-'.join, 1)
data['Date'] = pd.to_datetime(s, format='%y-%m-%d')
print (data)
Yr Mo Dy RPT VAL ROS KIL SHA BIR DUB CLA MUL \
0 61 1 1 15.04 14.96 13.17 9.29 NaN 9.87 13.67 10.25 10.83
1 61 1 2 14.71 NaN 10.83 6.50 12.62 7.67 11.50 10.04 9.79
2 61 1 3 18.50 16.88 12.33 10.13 11.17 6.17 11.25 NaN 8.50
CLO BEL Date
0 12.58 18.50 2061-01-01
1 9.67 17.54 2061-01-02
2 7.67 12.75 2061-01-03

Daily total active time in a pandas dataframe

I'm new in python and my English are not so good so i ll try to explain my problem with the example below.
In :ds # is my dataframe
Out :DateStarted DateCompleted DayStarted DayCompleted \
1460 2017-06-12 14:03:32 2017-06-12 14:04:07 2017-06-12 2017-06-12
14445 2017-06-13 13:39:16 2017-06-13 13:40:32 2017-06-13 2017-06-13
14109 2017-06-21 10:25:36 2017-06-21 10:32:17 2017-06-21 2017-06-21
16652 2017-06-27 15:44:28 2017-06-27 15:44:41 2017-06-27 2017-06-27
30062 2017-07-05 09:49:01 2017-07-05 10:04:00 2017-07-05 2017-07-05
22357 2017-08-31 09:06:00 2017-08-31 09:10:31 2017-08-31 2017-08-31
39117 2017-09-08 08:43:07 2017-09-08 08:44:51 2017-09-08 2017-09-08
41903 2017-09-15 12:54:40 2017-09-15 14:00:06 2017-09-15 2017-09-15
74633 2017-09-27 12:41:09 2017-09-27 13:16:04 2017-09-27 2017-09-27
69315 2017-10-23 08:25:28 2017-10-23 08:26:09 2017-10-23 2017-10-23
87508 2017-10-30 12:19:19 2017-10-30 12:19:45 2017-10-30 2017-10-30
86828 2017-11-03 12:20:09 2017-11-03 12:24:56 2017-11-03 2017-11-03
89877 2017-11-06 13:52:05 2017-11-06 13:52:50 2017-11-06 2017-11-06
94970 2017-11-07 08:09:53 2017-11-07 08:10:15 2017-11-07 2017-11-07
94866 2017-11-28 14:38:14 2017-11-30 07:51:04 2017-11-28 2017-11-30
DailyTotalActiveTime diff
1460 NaN 35.0
14445 NaN 76.0
14109 NaN 401.0
16652 NaN 13.0
30062 NaN 899.0
22357 NaN 271.0
39117 NaN 104.0
41903 NaN 3926.0
74633 NaN 2095.0
69315 NaN 41.0
87508 NaN 26.0
86828 NaN 287.0
89877 NaN 45.0
94970 NaN 22.0
94866 NaN 148370.0
In the DailyTotalActiveTime column, i want to calculate how much time,
the specific days, will have in total. The diff column is in seconds.
I tried this, but i had no results:
for i in ds['diff']:
if i <= 86400:
ds['DailyTotalActiveTime']==i
else:
ds['DailyTotalActiveTime']==86400
ds['DailyTotalActiveTime']+1 == i-86400
What can i do? Again, sorry for the explanation..
You should try with = instead of ==
To get you halfway there, you could do something like the following (I am sure there must an a more simple way but I can't see it right now):
df['datestarted'] = pd.to_datetime(df['datestarted'])
df['datecompleted'] = pd.to_datetime(df['datecompleted'])
df['daystarted'] = df['datestarted'].dt.date
df['daycompleted'] = df['datecompleted'].dt.date
df['Date'] = df['daystarted'] # This is the unqiue date per row.
for row in df.itertuples():
if (row.daycompleted - row.daystarted) > pd.Timedelta(days=0):
for i in range(1, (row.daycompleted - row.daystarted).days+1):
df2 = pd.DataFrame([row]).drop('Index', axis=1)
df2['Date'] = df2['Date'] + pd.Timedelta(days=i)
df = df.append(df2)

Filter a timeseries with some predefined dates in Pandas

I have this code :
close[close['Datetime'].isin(datefilter)] #Only date in the range
close1='Close' ; start='12/18/2015 00:00:00';
end='3/1/2016 00:00:00'; freq='1d0h00min';
datefilter= pd.date_range(start=start, end=end, freq= freq).values
But, strangely, some columns are given back with Nan:
Datetime ENTA KITE BSTC SAGE AGEN MGNX ESPR FPRX
2015-12-18 31.73 63.38 16.34 56.88 12.24 NaN NaN 38.72
2015-12-21 32.04 63.60 16.26 56.75 12.18 NaN NaN 42.52
Just wondering the reasons, and how can we remedy ?
Original :
Datetime ENTA KITE BSTC SAGE AGEN MGNX ESPR FPRX
0 2013-03-21 17.18 29.0 20.75 30.1 11.52 11.52 38.72
1 2013-03-22 16.81 30.53 21.25 30.0 11.64 11.52 39.42
2 2013-03-25 16.83 32.15 20.8 27.59 11.7 11.52 42.52
3 2013-03-26 17.09 29.55 20.6 27.5 11.76 11.52 11.52
EDIT:
it seems related to the datetime hh:mm:ss filtering.

Categories

Resources