I'm creating a stock screener based on fundamental metrics using yahoofinancials module.
Below code gives output in multidimensional dictionary format that I'm not able to convert into dataframe format for further analysis.
import pandas as pd
from yahoofinancials import YahooFinancials
ticker = 'RELIANCE.NS'
yahoo_financials = YahooFinancials(ticker)
income_statement_data_qt = yahoo_financials.get_financial_stmts('quarterly', 'income')
income_statement_data_qt
Output:
Ideally, I'd like to have data in this format.
You can use list comprehension to iterate over the dictionaries from that particular ticker and use Pandas concat to concatenate the data along the columns axis (axis=1). Then, use rename_axis and reset_index to convert the index to a column with the desired name. Create a new column with the ticker name at the first position using insert.
import pandas as pd
from yahoofinancials import YahooFinancials
ticker = 'RELIANCE.NS'
yahoo_financials = YahooFinancials(ticker)
income_statement_data_qt = yahoo_financials.get_financial_stmts('quarterly', 'income')
dict_list = income_statement_data_qt['incomeStatementHistoryQuarterly'][ticker]
df = pd.concat([pd.DataFrame(i) for i in dict_list], axis=1)
df = df.rename_axis('incomeStatementHistoryQuarterly').reset_index()
df.insert(0, 'ticker', ticker)
print(df)
Output from df
ticker incomeStatementHistoryQuarterly ... 2021-03-31 2020-12-31
0 RELIANCE.NS costOfRevenue ... 1.034690e+12 7.224900e+11
1 RELIANCE.NS discontinuedOperations ... NaN NaN
2 RELIANCE.NS ebit ... 1.571800e+11 1.490100e+11
3 RELIANCE.NS effectOfAccountingCharges ... NaN NaN
...
...
18 RELIANCE.NS sellingGeneralAdministrative ... 3.976000e+10 4.244000e+10
19 RELIANCE.NS totalOperatingExpenses ... 1.338570e+12 1.029590e+12
20 RELIANCE.NS totalOtherIncomeExpenseNet ... -1.330000e+09 2.020000e+09
21 RELIANCE.NS totalRevenue ... 1.495750e+12 1.178600e+12
[22 rows x 6 columns]
Related
I read and transform data using the following code
import matplotlib.pyplot as plt
import pandas as pd
import matplotlib.dates as dates
import numpy as np
df = pd.read_csv('data/C2A2_data/BinnedCsvs_d400/fb441e62df2d58994928907a91895ec62c2c42e6cd075c2700843b89.csv', parse_dates=['Date'])
df.drop('ID', axis='columns', inplace = True)
df_min = df[(df['Date']<='2014-12') & (df['Date']>='2004-01') & (df['Element']=='TMIN')]
df_min.drop('Element', axis='columns', inplace = True)
df_min = df_min.groupby('Date').agg({'Data_Value': 'min'}).reset_index()
giving the following result
Date Data_Value
0 2005-01-01 -56
1 2005-01-02 -56
2 2005-01-03 0
3 2005-01-04 -39
4 2005-01-05 -94
Now I try to get the Date in Year-Month. So
Date Data_Value
0 2005-01 -94
1 2005-02 xx
2 2005-03 xx
3 2005-04 xx
4 2005-05 xx
Where xx is the minimum value for that year-month.
how do I have to change the Groupby function or is this not possible with this function?
Use pd.Grouper() to accumulate by yearly/monthly/daily frequencies.
Code
df_min["Date"] = pd.to_datetime(df_min["Date"])
df_ans = df_min.groupby(pd.Grouper(key="Date", freq="M")).min()
Result
print(df_ans)
Data_Value
Date
2005-01-31 -94
You can first map Date column in order to get only year and month, and then just perform a groupby and get the min for each group:
# import libraries
import pandas as pd
# test data
data = [['2005-01-01', -56],['2005-01-01', -3],['2005-01-01', 6],
['2005-01-01', 26],['2005-01-01', 56],['2005-02-01', -26],
['2005-02-01', -2],['2005-02-01', 6],['2005-02-01', 26],
['2005-03-01', 56],['2005-03-01', -33],['2005-03-01', -5],
['2005-03-01', 6],['2005-03-01', 26],['2005-03-01', 56]]
# create dataframe
df_min = pd.DataFrame(data=data, columns=["Date", "Date_value"])
# convert 'Date' column to datetime datatype
df_min['Date'] = pd.to_datetime(df_min['Date'])
# get only year and month
df_min['Date'] = df_min['Date'].map(lambda x: str(x.year)+'-'+str(x.month))
# get min value for each group
df_min = df_min.groupby('Date').min()
After printing df_min, output must be:
Date_value
Date
2005-01-01 -56
2005-02-01 -26
2005-03-01 -33
I need help reformatting my DataFrame output for stock closing prices.
Currently my output has the Stock Symbols as Headers where I would like to have them displayed in rows. df_output = 1: https://i.stack.imgur.com/u4jEk.png
I would like to have it displayed as below:
results
This is my current df_output code (not sure if this is the reason):
prices_df = pd.DataFrame({
a: {x['formatted_date']: x['adjclose'] for x in data[a]['prices']} for a in assets})
excel_list
FULL CODE:
import pandas as pd
import numpy as np
import yfinance as yf
from yahoofinancials import YahooFinancials
from datetime import datetime
import time
start_time = time.time()
df = pd.read_excel(r'C:\Users\Ryan\Desktop\Stock Portfolio\\My Portfolio.xlsx', sheet_name=0, skiprows=2)
list1 = list(df['Stock Code'])
assets = list1
yahoo_financials = YahooFinancials(assets)
data = yahoo_financials.get_historical_price_data(start_date=str(datetime.now().date().replace(month=1, day=1)),
end_date=str(datetime.now().date().replace(month=12, day=31)),
time_interval='daily')
prices_df = pd.DataFrame({
a: {x['formatted_date']: x['adjclose'] for x in data[a]['prices']} for a in assets})
Check pandas functions such as https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.wide_to_long.html and https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.pivot_table.html for operations for converting between long and wide formats.
Try this:
prices_df.rename_axis('Date').reset_index().melt('Date', var_name='Symbol', value_name='Price')
Output:
Date Symbol Price
0 2020-01-02 FB 209.779999
1 2020-01-03 FB 208.669998
2 2020-01-06 FB 212.600006
3 2020-01-07 FB 213.059998
4 2020-01-08 FB 215.220001
.. ... ... ...
973 2020-08-18 CDNS 109.150002
974 2020-08-19 CDNS 108.529999
975 2020-08-20 CDNS 111.260002
976 2020-08-21 CDNS 110.570000
977 2020-08-24 CDNS 111.260002
[978 rows x 3 columns]
I'm doing a finance study based on the youtube link below and I would like to understand why I got the NaN return instead of the expected calculation. What do I need to do in this script to reach the expected value?
YouTube case: https://www.youtube.com/watch?v=UpbpvP0m5d8
import investpy as env
import numpy as np
import pandas as pd
lt = ['ABEV3','CEAB3','ENBR3','FLRY3','IRBR3','ITSA4','JHSF3','STBP3']
prices = pd.DataFrame()
for i in lt:
df = env.get_stock_historical_data(stock=i, from_date='01/01/2020', to_date='29/05/2020', country='brazil')
df['Ativo'] = i
prices = pd.concat([prices, df], sort=True)
pivoted = prices.pivot(columns='Ativo', values='Close')
e_r = pivoted.resample('Y').last().pct_change().mean()
e_r
Return:
Ativo
ABEV3 NaN
CEAB3 NaN
ENBR3 NaN
FLRY3 NaN
IRBR3 NaN
ITSA4 NaN
JHSF3 NaN
STBP3 NaN
dtype: float64
You need to change the 'from_date' to have more than one year of data.
You current script returns one row and .pct_change() on one row of data returns NaN, because there is no previous row to compare against.
When I changed from_date to '01/01/2018'
import investpy as env
import numpy as np
import pandas as pd
lt = ['ABEV3','CEAB3','ENBR3','FLRY3','IRBR3','ITSA4','JHSF3','STBP3']
prices = pd.DataFrame()
for i in lt:
df = env.get_stock_historical_data(stock=i, from_date='01/01/2018', to_date='29/05/2020', country='brazil')
df['Ativo'] = i
prices = pd.concat([prices, df], sort=True)
pivoted = prices.pivot(columns='Ativo', values='Close')
e_r = pivoted.resample('Y').last().pct_change().mean()
e_r
I get the following output:
Ativo
ABEV3 -0.043025
CEAB3 -0.464669
ENBR3 0.180655
FLRY3 0.191976
IRBR3 -0.175084
ITSA4 -0.035767
JHSF3 1.283291
STBP3 0.223627
dtype: float64
I have some long winded code here with an issue when I am attempting to join (or merge/concat) two datasets together, I get this TypeError: Cannot compare type 'Timestamp' with type 'int'
The two datasets both come from resampling the same initial starting dataset. The master_hrs df is a resampling process using the a change point algorithm Python package called rupters. (pip install ruptures to run code). daily_summary df is just using Pandas to resample daily mean & sum values. But I get the error when I attempt to combine the datasets together. Would anyone have any tips to try?
Making up some fake data generates the same error as my real world dataset. I think the issue I have is I am trying to compare datime to numpy some how... Any tips greatly appreciated. Thanks
import ruptures as rpt
import calendar
import numpy as np
import pandas as pd
np.random.seed(11)
rows,cols = 50000,2
data = np.random.rand(rows,cols)
tidx = pd.date_range('2019-01-01', periods=rows, freq='H')
df = pd.DataFrame(data, columns=['Temperature','Value'], index=tidx)
def changPointDf(df):
arr = np.array(df.Value)
#Define Binary Segmentation search method
model = "l2"
algo = rpt.Binseg(model=model).fit(arr)
my_bkps = algo.predict(n_bkps=5)
# getting the timestamps of the change points
bkps_timestamps = df.iloc[[0] + my_bkps[:-1] +[-1]].index
# computing the durations between change points
durations = (bkps_timestamps[1:] - bkps_timestamps[:-1])
#hours calc
d = durations.seconds/60/60
d_f = pd.DataFrame(d)
df2 = d_f.T
return df2
master_hrs = pd.DataFrame()
for idx, days in df.groupby(df.index.date):
changPoint_df = changPointDf(days)
values = changPoint_df.values.tolist()
master_hrs=master_hrs.append(values)
master_hrs.columns = ['overnight_AM_hrs', 'moring_startup_hrs', 'moring_ramp_hrs', 'high_load_hrs', 'evening_shoulder_hrs']
daily_summary = pd.DataFrame()
daily_summary['Temperature'] = df['Temperature'].resample('D').mean()
daily_summary['Value'] = df['Value'].resample('D').sum()
final_df = daily_summary.join(master_hrs)
The issue was the indexes themselves - master_hrs was int64 whereas daily_summary was datetime. Include this before joining the two dataframes together:
master_hrs.index = pd.to_datetime(master_hrs.index)
Just for clarity, here's my output of final_df:
Temperature Value ... high_load_hrs evening_shoulder_hrs
2019-01-01 0.417517 12.154527 ... NaN NaN
2019-01-02 0.521131 13.811842 ... NaN NaN
2019-01-03 0.583205 12.568966 ... NaN NaN
2019-01-04 0.448225 14.036136 ... NaN NaN
2019-01-05 0.542870 10.738192 ... NaN NaN
... ... ... ... ...
2024-09-10 0.470421 13.775528 ... NaN NaN
2024-09-11 0.384672 10.473930 ... NaN NaN
2024-09-12 0.527284 14.000231 ... NaN NaN
2024-09-13 0.555646 11.460867 ... NaN NaN
2024-09-14 0.426003 3.763975 ... NaN NaN
[2084 rows x 7 columns]
Hopefully this gets you what you need.
I have a pandas df, and I use between_time a and b to clean the data. How do I
get a non_between_time behavior?
I know i can try something like.
df.between_time['00:00:00', a]
df.between_time[b,23:59:59']
then combine it and sort the new df. It's very inefficient and it doesn't work for me as I have data betweeen 23:59:59 and 00:00:00
Thanks
You could find the index locations for rows with time between a and b, and then use df.index.diff to remove those from the index:
import pandas as pd
import io
text = '''\
date,time, val
20120105, 080000, 1
20120105, 080030, 2
20120105, 080100, 3
20120105, 080130, 4
20120105, 080200, 5
20120105, 235959.01, 6
'''
df = pd.read_csv(io.BytesIO(text), parse_dates=[[0, 1]], index_col=0)
index = df.index
ivals = index.indexer_between_time('8:01:30','8:02')
print(df.reindex(index.diff(index[ivals])))
yields
val
date_time
2012-01-05 08:00:00 1
2012-01-05 08:00:30 2
2012-01-05 08:01:00 3
2012-01-05 23:59:59.010000 6