Converting Data into pandas data frame format - python

I have following dummy calculation in Python language
from datetime import datetime
import pandas as pd
result = ### based on some calculation
print(result)
With this I am getting answer in below format:
(
(
'date', pywintypes.datetime(2020, 6, 15, 0, 0, tzinfo=TimeZoneInfo('GMT Standard Time', True)), pywintypes.datetime(2020, 7, 15, 0, 0, tzinfo=TimeZoneInfo('GMT Standard Time', True))
),
(
'var1', 200, 340
),
(
'var2', 1200, -340
)
)
I failed to understand what is this format exactly? How can I convert this data to a Pandas data-frame format for further calculation?
Any pointer will be very helpful.

seems like its a tuple of tuples , but if you run this :
print(type(result))
you can get a better idea

Given your tuple of tuples format you could use:
import pandas as pd
df = pd.DataFrame(result).set_index(0).T
Output:
date var1 var2
1 2020-06-15 200 1200
2 2020-07-15 340 -340

You can try
pd.DataFrame(list(result))

Related

Problem about using talib to generate MACD dataframe using different period sets

import yfinance as yf
import pandas as pd
import talib
code = '2800'
para_dict = {
'sample_period_list': [200],
'fastperiod_list': [12, 16],
'slowperiod_list': [26, 30],
'signalperiod_list': [8, 12],
'higher_percentile_list': [0.8],
'profit_target': [0.04],
'stop_loss': [-0.04]
}
start_date = '2020-01-01'
end_date = '2022-10-10'
df_dict = {}
df = yf.Ticker(code + '.HK').history(start=start_date, end=end_date)
df = df[df['Volume'] > 0]
df = df[['Open', 'High', 'Low', 'Close']]
# df['pnl_percentage'] = df['Open'].pct_change()
df = df.reset_index()
for fastperiod in para_dict['fastperiod_list']:
for slowperiod in para_dict['slowperiod_list']:
for signalperiod in para_dict['signalperiod_list']:
macd_key = str(fastperiod) + '_' + str(slowperiod) + '_' + str(signalperiod)
df['macd'], df['macdsignal'], df['macdhist'] = talib.MACD(df['Close'], fastperiod=fastperiod, slowperiod=slowperiod, signalperiod=signalperiod)
df_dict[macd_key] = df
print(df_dict)
I cant get the right dataframe for different MACD periods, instead I generate the same dataframe using different MACD periods by below codes? WHY
I cant get the right dataframe for different MACD periods, instead I generate the same dataframe using different MACD periods by below codes? WHY
The reason is because you're pointing to the same dataframe , if you change one they all change so in your example they will be all equal to the last df.
you can read more in it in those questions :
Modifying one dataframe appears to change another
Why can pandas DataFrames change each other?
As a solution to your case , you need to use a copy of the dataframe not the actual dataframe :
df_dict[macd_key] = df.copy()
#instead of df_dict[macd_key] = df
it will solve your issue

How to convert a column in a dataframe to an index datetime object?

I have a question about how to convert a column 'Timestamp' into an index&datetime. And then also drop the column once it's converted into an index.
df = {'Timestamp':['20/01/2021 01:00:00.12 AM','20/01/2021 01:00:00.21 AM','20/01/2021 01:00:01.34 AM],
'Value':['14','178','158','75']}
I tried the following, but obvious didn't work.
df.Timestamp = pd.to_datetime(df.Timestamp.str[0])
df=df.set_index(['Timestamp'], drop=True)
FYI. The df is actually a lot text processing so unfortunately I cannot just do read_csv and parse datetime object. :( So yes, the df is exactly as what's prescribed above.
Thank you.
Don't enclose 'Timestamp' in square brackets.
import pandas as pd
df = pd.DataFrame({'Timestamp':['20/01/2021 01:00:00.12 AM','20/01/2021 01:00:00.21 AM','20/01/2021 01:00:01.34 AM'],
'Value':['14','178','158']})
df['Timestamp'] = pd.to_datetime(df['Timestamp'])
df = df.set_index('Timestamp')
print(df)
## Output
Value
Timestamp
20/01/2021 01:00:00.12 AM 14
20/01/2021 01:00:00.21 AM 178
20/01/2021 01:00:01.34 AM 158

Strange behavior of datetimes when loaded into pd.DataFrame

I'm trying to construct simple DataFrames. Both have a date whereas the first has one additional column:
import pandas as pd
import datetime as dt
import numpy as np
a = pd.DataFrame(np.array([
[dt.datetime(2018, 1, 10), 5.0]]), columns=['date', 'amount'])
print(a)
# date_dt amount
# 2018-01-10 00:00:00 5
b = pd.DataFrame(np.array([
[dt.datetime(2018, 1, 10)]]), columns=['date'])
print(b)
# date_dt
# 2018-01-10
Why are the dates interpreted differently (with and without time)? It gives me problems when I later try to apply merges.
Ok, so here is what happens. I will use the following code:
import pandas as pd
import datetime as dt
import numpy as np
a_val = np.array([[dt.datetime(2018, 1, 10), 5.0]])
a = pd.DataFrame(a_val, columns=['date', 'amount'])
b_val = np.array([[dt.datetime(2018, 1, 10)]])
b = pd.DataFrame(b_val, columns=['date'])
I just split the contents of the pd dataframes and call to the dataframe themselves. First let's print thr a_val and b_val variables:
print(a_val, b_val)
# output: [[datetime.datetime(2018, 1, 10, 0, 0) 5.0]] [[datetime.datetime(2018, 1, 10, 0, 0)]]
So still good, the object are datetime.datetime.
Now let's access the values of the dataframe with .values:
print(a.values, b.values)
# output: [[datetime.datetime(2018, 1, 10, 0, 0) 5.0]] [['2018-01-10T00:00:00.000000000']]
Things are messed up here. Let's print the type of the date:
print(type(a.values[0][0]), type(b.values[0][0]))
# output: <class 'datetime.datetime'> <class 'numpy.datetime64'>
Ok, that's the thing: since in the second dataframe you just have a date object, and you call np.array(), the date is cast to a numpy.datetime64 object, which has a different formatting. Instead, in the first dataframe you have a datetime object plus an int, and the code left them as is.
Short version: if you have a collection of different objects like dates, strings, int etc. use a list, not a numpy array
Both columns in a are objects because of the numpy array that's an intermediate (and is of type object). I'd think that not implicitly interpreting mixed objects is probably good behavior.
a = pd.DataFrame([[dt.datetime(2018, 1, 10), 5.0]], columns=['date', 'amount'])
This seems to be more along the lines of what you want.

Pandas Yahoo Stock API

I am new to Pandas (and Python) and trying to working with the Yahoo API for stock prices.
I need to get the data, loop through it and grab the dates and values.
here is the code
df = pd.get_data_yahoo( symbols = 'AAPL',
start = datetime( 2011, 1, 1 ),
end = datetime( 2012, 1, 1 ),
interval = 'm' )
results are:
df
Open High Low Close Volume
Date
2011-01-03 325.640015 348.600006 324.840027 339.320007 140234700
2011-02-01 341.299988 364.899994 337.720001 353.210022 127618700
2011-03-01 355.470001 361.669983 326.259979 348.510010 125874700
I can get the dates but not the month date value because it is the index(?)
How best to loop through the data for this information? This is about processing the data and not sorting or searching it.
If you need to iterate over the rows in your dataframe, and do some processing, then pandas.DataFrame.apply() works great.
Code:
Some mock processing code...
def process_data(row):
# the index becomes the name when converted to a series (row)
print(row.name.month, row.Close)
Test Code:
import datetime as dt
from pandas_datareader import data
df = data.get_data_yahoo(
'AAPL',
start=dt.datetime(2011, 1, 1),
end=dt.datetime(2011, 5, 1),
interval='m')
print(df)
# process each row
df.apply(process_data, axis=1)
Results:
Open High Low Close Volume \
Date
2011-01-03 325.640015 348.600006 324.840027 339.320007 140234700
2011-02-01 341.299988 364.899994 337.720001 353.210022 127618700
2011-03-01 355.470001 361.669983 326.259979 348.510010 125874700
2011-04-01 351.110016 355.130005 320.160004 350.130005 128252100
Adj Close
Date
2011-01-03 43.962147
2011-02-01 45.761730
2011-03-01 45.152802
2011-04-01 45.362682
1 339.320007
2 353.210022
3 348.51001
4 350.130005
here is what made my life groovy when trying to work with the data from Yahoo.
First was getting the date from the dataframe index.
df = df.assign( date = df.index.date )
here are a few others I found helpful from dealing with the data.
df [ 'diff' ] = df [ 'Close' ].diff( )
df [ 'pct_chg' ] = df [ 'Close' ].pct_change()
df [ 'hl' ] = df [ 'High' ] - df [ 'Low' ]
Pandas is amazing stuff.
I believe this should work for you.
import pandas_datareader.data as web
import datetime
start = datetime.datetime(2013, 1, 1)
end = datetime.datetime(2016, 1, 27)
df = web.DataReader("GOOGL", 'yahoo', start, end)
dates =[]
for x in range(len(df)):
newdate = str(df.index[x])
newdate = newdate[0:10]
dates.append(newdate)
df['dates'] = dates
print df.head()
print df.tail()
Also, take a look at the link below for more helpful hints of how to do these kinds of things.
https://pandas-datareader.readthedocs.io/en/latest/remote_data.html#yahoo-finance
from pandas_datareader import data as pdr
from datetime import date
import yfinance as yf
yf.pdr_override()
import pandas as pd
import requests
import json
from os import listdir
from os.path import isfile, join
# Tickers List
tickers_list = ['AAPL', 'GOOGL','FB', 'WB' , 'MO']
today = date.today()
# We can get data by our choice by giving days bracket
start_date= "2010-01-01"
files=[]
def getData(ticker):
print (ticker)
data = pdr.get_data_yahoo(ticker, start=start_date, end=today)
dataname= ticker+'_'+str(today)
files.append(dataname)
SaveData(data, dataname)
# Create an data folder to save these data file in data folder.
def SaveData(df, filename):
df.to_csv('./data/'+filename+'.csv')
for tik in tickers_list:
getData(tik)

Python Pandas join dataframes on index

I am trying to join to dataframe on the same column "Date", the code is as follow:
import pandas as pd
from datetime import datetime
df_train_csv = pd.read_csv('./train.csv',parse_dates=['Date'],index_col='Date')
start = datetime(2010, 2, 5)
end = datetime(2012, 10, 26)
df_train_fly = pd.date_range(start, end, freq="W-FRI")
df_train_fly = pd.DataFrame(pd.Series(df_train_fly), columns=['Date'])
merged = df_train_csv.join(df_train_fly.set_index(['Date']), on = ['Date'], how = 'right', lsuffix='_x')
It complains dataframe df_train_csv has no column named "Date". I'd like to set "Date" in both dataframe as index and I am wondering what is the best way to join dataframe with date as the index?
UPDATE:
That is the sample data
Date,Weekly_Sales
2010-02-05,24924.5
2010-02-12,46039.49
2010-02-19,41595.55
2010-02-26,19403.54
2010-03-05,21827.9
2010-03-12,21043.39
2010-03-19,22136.64
2010-03-26,26229.21
2010-04-02,57258.43
2010-04-09,42960.91
2010-04-16,17596.96
2010-04-23,16145.35
2010-04-30,16555.11
2010-05-07,17413.94
2010-05-14,18926.74
2010-05-21,14773.04
2010-05-28,15580.43
2010-06-04,17558.09
2010-06-11,16637.62
2010-06-18,16216.27
2010-06-25,16328.72
2010-07-02,16333.14
2010-07-09,17688.76
2010-07-16,17150.84
2010-07-23,15360.45
2010-07-30,15381.82
2010-08-06,17508.41
2010-08-13,15536.4
2010-08-20,15740.13
2010-08-27,15793.87
2010-09-03,16241.78
2010-09-10,18194.74
2010-09-17,19354.23
2010-09-24,18122.52
2010-10-01,20094.19
2010-10-08,23388.03
2010-10-15,26978.34
2010-10-22,25543.04
2010-10-29,38640.93
2010-11-05,34238.88
2010-11-12,19549.39
2010-11-19,19552.84
2010-11-26,18820.29
2010-12-03,22517.56
2010-12-10,31497.65
2010-12-17,44912.86
2010-12-24,55931.23
2010-12-31,19124.58
2011-01-07,15984.24
2011-01-14,17359.7
2011-01-21,17341.47
2011-01-28,18461.18
2011-02-04,21665.76
2011-02-11,37887.17
2011-02-18,46845.87
2011-02-25,19363.83
2011-03-04,20327.61
2011-03-11,21280.4
2011-03-18,20334.23
2011-03-25,20881.1
2011-04-01,20398.09
2011-04-08,23873.79
2011-04-15,28762.37
2011-04-22,50510.31
2011-04-29,41512.39
2011-05-06,20138.19
2011-05-13,17235.15
2011-05-20,15136.78
2011-05-27,15741.6
2011-06-03,16434.15
2011-06-10,15883.52
2011-06-17,14978.09
2011-06-24,15682.81
2011-07-01,15363.5
2011-07-08,16148.87
2011-07-15,15654.85
2011-07-22,15766.6
2011-07-29,15922.41
2011-08-05,15295.55
2011-08-12,14539.79
2011-08-19,14689.24
2011-08-26,14537.37
2011-09-02,15277.27
2011-09-09,17746.68
2011-09-16,18535.48
2011-09-23,17859.3
2011-09-30,18337.68
2011-10-07,20797.58
2011-10-14,23077.55
2011-10-21,23351.8
2011-10-28,31579.9
2011-11-04,39886.06
2011-11-11,18689.54
2011-11-18,19050.66
2011-11-25,20911.25
2011-12-02,25293.49
2011-12-09,33305.92
2011-12-16,45773.03
2011-12-23,46788.75
2011-12-30,23350.88
2012-01-06,16567.69
2012-01-13,16894.4
2012-01-20,18365.1
2012-01-27,18378.16
2012-02-03,23510.49
2012-02-10,36988.49
2012-02-17,54060.1
2012-02-24,20124.22
2012-03-02,20113.03
2012-03-09,21140.07
2012-03-16,22366.88
2012-03-23,22107.7
2012-03-30,28952.86
2012-04-06,57592.12
2012-04-13,34684.21
2012-04-20,16976.19
2012-04-27,16347.6
2012-05-04,17147.44
2012-05-11,18164.2
2012-05-18,18517.79
2012-05-25,16963.55
2012-06-01,16065.49
2012-06-08,17666
2012-06-15,17558.82
2012-06-22,16633.41
2012-06-29,15722.82
2012-07-06,17823.37
2012-07-13,16566.18
2012-07-20,16348.06
2012-07-27,15731.18
2012-08-03,16628.31
2012-08-10,16119.92
2012-08-17,17330.7
2012-08-24,16286.4
2012-08-31,16680.24
2012-09-07,18322.37
2012-09-14,19616.22
2012-09-21,19251.5
2012-09-28,18947.81
2012-10-05,21904.47
2012-10-12,22764.01
2012-10-19,24185.27
2012-10-26,27390.81
I will read it from a csv file. But sometimes, some weeks may be missing. Therefore, I am trying to generate a date range like this:
df_train_fly = pd.date_range(start, end, freq="W-FRI")
This generated dataframe contains all weeks in the range so I need to merge those two dataframe into one.
If I check df_train_csv['Date'] and df_train_fly['Date'] from the iPython console, they both showed as dtype: datetime64[ns]
So let's dissect this:
df_train_csv = pd.read_csv('./train.csv',parse_dates=['Date'],index_col='Date')
OK first problem here is you have specified that the index column should be 'Date' this means that you will not have a 'Date' column anymore.
start = datetime(2010, 2, 5)
end = datetime(2012, 10, 26)
df_train_fly = pd.date_range(start, end, freq="W-FRI")
df_train_fly = pd.DataFrame(pd.Series(df_train_fly), columns=['Date'])
merged = df_train_csv.join(df_train_fly.set_index(['Date']), on = ['Date'], how = 'right', lsuffix='_x')
So the above join will not work as the error reported so in order to fix this:
# remove the index_col param
df_train_csv = pd.read_csv('./train.csv',parse_dates=['Date'])
# don't set the index on df_train_fly
merged = df_train_csv.join(df_train_fly, on = ['Date'], how = 'right', lsuffix='_x')
OR don't set the 'on' param:
merged = df_train_csv.join(df_train_fly, how = 'right', lsuffix='_x')
the above will use the index of both df's to join on
You can also achieve the same result by performing a merge instead:
merged = df_train_csv.merge(df_train_fly.set_index(['Date']), left_index=True, right_index=True, how = 'right', lsuffix='_x')

Categories

Resources