How can you append to an existing df from inside a for loop? For example:
import pandas as pd
from pandas_datareader import data as web
stocks = ['amc', 'aapl']
colnames = ['Datetime', 'Open', 'High', 'Low', 'Close', 'Adj Close', 'Volume', 'Name']
df1 = pd.DataFrame(data=None, columns=colnames)
for stock in stocks:
df = web.DataReader(stock, 'yahoo')
df['Name'] = stock
What should I do next so that df is appended to df1?
You could try pandas.concat()
df1 = pd.DataFrame(data=None, columns=colnames)
for stock in stocks:
df = web.DataReader(stock, 'yahoo')
df['Name'] = stock
df1 = pd.concat([df1, df], ignore_index=True)
Instead of concat dataframe in each loop, you could also try append dataframe to a list
dfs = []
for stock in stocks:
df = web.DataReader(stock, 'yahoo')
df['Name'] = stock
dfs.append(df)
df_ = pd.concat(dfs, ignore_index=True)
print(df_)
High Low Open Close Volume Adj Close Name
0 32.049999 31.549999 31.900000 31.549999 1867000.0 24.759203 amc
1 31.799999 30.879999 31.750000 31.000000 1225900.0 24.327585 amc
2 31.000000 30.350000 30.950001 30.799999 932100.0 24.170631 amc
3 30.900000 30.250000 30.700001 30.350000 1099000.0 23.817492 amc
4 30.700001 30.100000 30.549999 30.650000 782500.0 24.052916 amc
... ... ... ... ... ... ... ...
2515 179.009995 176.339996 176.690002 178.960007 100589400.0 178.960007 aapl
2516 179.610001 176.699997 178.550003 177.770004 92633200.0 177.770004 aapl
2517 178.029999 174.399994 177.839996 174.610001 103049300.0 174.610001 aapl
2518 174.880005 171.940002 174.029999 174.309998 78699800.0 174.309998 aapl
2519 174.880005 171.940002 174.029999 174.309998 78751328.0 174.309998 aapl
[2520 rows x 7 columns]
What you're trying to do won't quite work, since the data retrieved by DataReader has several columns and you need that data for several stocks. However, each of those columns is a time series.
So what you probably want is something that looks like this:
Stock amc
Field High Low Open ...
2022-03-30 29.230000 25.350000 ...
2022-03-31 25.920000 23.260000 ...
2022-04-01 25.280001 22.340000 ...
2022-04-01 25.280001 22.340000 ...
...
And you'd be able to access like df[('amc', 'Low')] to get a time series for that stock, or like df[('amc', 'Low')]['2022-04-01'][0] to get the 'Low' value for 'amc' on April 1st.
This gets you exactly that:
import pandas as pd
from pandas_datareader import data as web
stocks = ['amc', 'aapl']
df = pd.DataFrame()
for stock_name in stocks:
stock_df = web.DataReader(stock_name, data_source='yahoo')
for col in stock_df:
df[(stock_name, col)] = stock_df[col]
df.columns = pd.MultiIndex.from_tuples(df.columns, names=['Stock', 'Field'])
print(f'\nall data:\n{"-"*40}\n', df)
print(f'\none series:\n{"-"*40}\n', df[('aapl', 'Volume')])
print(f'\nsingle value:\n{"-"*40}\n', df[('amc', 'Low')]['2022-04-01'][0])
The solution uses a MultiIndex to achieve what you need. It first loads all the data as retrieved from the API into columns labeled with tuples of stock name and field, and it then converts that into a proper MultiIndex after loading completes.
Output:
all data:
----------------------------------------
Stock amc ... aapl
Field High Low ... Volume Adj Close
Date ...
2017-04-04 32.049999 31.549999 ... 79565600.0 34.171505
2017-04-05 31.799999 30.879999 ... 110871600.0 33.994480
2017-04-06 31.000000 30.350000 ... 84596000.0 33.909496
2017-04-07 30.900000 30.250000 ... 66688800.0 33.833969
2017-04-10 30.700001 30.100000 ... 75733600.0 33.793839
... ... ... ... ... ...
2022-03-29 34.330002 26.410000 ... 100589400.0 178.960007
2022-03-30 29.230000 25.350000 ... 92633200.0 177.770004
2022-03-31 25.920000 23.260000 ... 103049300.0 174.610001
2022-04-01 25.280001 22.340000 ... 78699800.0 174.309998
2022-04-01 25.280001 22.340000 ... 78751328.0 174.309998
[1260 rows x 12 columns]
one series:
----------------------------------------
Date
2017-04-04 79565600.0
2017-04-05 110871600.0
2017-04-06 84596000.0
2017-04-07 66688800.0
2017-04-10 75733600.0
...
2022-03-29 100589400.0
2022-03-30 92633200.0
2022-03-31 103049300.0
2022-04-01 78699800.0
2022-04-01 78751328.0
Name: (aapl, Volume), Length: 1260, dtype: float64
single value:
----------------------------------------
22.34000015258789
Related
My initial Pandas Dataframe is:
Open High Low Close Volume
2021-11-19 09:30:00-05:00 16549.50 16559.25 16516.25 16530.25 20198.0
2021-11-19 09:35:00-05:00 16530.50 16562.00 16525.50 16556.50 11274.0
... ... ... ... ... ...
2021-11-21 09:35:00-05:00 16556.50 16564.00 16544.00 16563.75 7632.0
2021-11-21 09:40:00-05:00 16563.75 16583.25 16555.50 16580.50 10404.0
... ... ... ... ... ...
2021-11-22 09:50:00-05:00 16580.25 16589.00 16571.25 16587.50 7997.0
2021-11-22 09:55:00-05:00 16580.25 16589.00 16571.25 16587.50 7997.0
... ... ... ... ... ...
My desired Dataframe from the above Dataframe:
Open Days Open
2021-11-19 09:30:00-05:00 16549.50 16549.50
2021-11-19 09:35:00-05:00 16530.50 16549.50
... ... ... ... ... ...
2021-11-21 09:35:00-05:00 16556.50 16556.50
2021-11-21 09:40:00-05:00 16563.75 16556.50
... ... ... ... ... ...
2021-11-22 09:50:00-05:00 16580.25 16580.25
2021-11-22 09:55:00-05:00 16680.25 16580.25
... ... ... ... ... ...
I tried the below code and get the first row for every day.
In [1]: df.groupby([df.index.date], axis=0).first()
Edit:
Here is my solution. May be there are better solutions than this solution.
#Here df is our initial ohlcv dataframe
#Groupby date and timestamp
df1 = pd.DataFrame(df.groupby([df.index.date, df.index], axis=0)['Open'].first())
#Make Timestamp as column
df1.reset_index(level=1, inplace=True)
#Groupby date and get the first row for every date. Then change set the Timestamp column as index
df1 = df1.groupby(level=[0]).first().set_index('level_1')
#Change the coulmn name Open to Days Open
df1.rename(columns={'Open':"Day's Open"}, inplace=True)
#Change the index name level_1 to None
df1.index.name = None
#Make a dataframe with Open column from initial ohlcv dataframe
df2 = pd.DataFrame(df['Open'])
#Make column named Days Open with Nan value
df2["Day's Open"]=np.nan
#Update the datafarame with daily open data
df2.update(df1)
#Fill the NAN value with last non-nan value
df2.fillna(method='ffill')
Here is my solution. May be there are better solutions than this solution.
#Here df is our initial ohlcv dataframe
#Groupby date and timestamp
df1 = pd.DataFrame(df.groupby([df.index.date, df.index], axis=0)['Open'].first())
#Make Timestamp as column
df1.reset_index(level=1, inplace=True)
#Groupby date and get the first row for every date. Then change set the Timestamp column as index
df1 = df1.groupby(level=[0]).first().set_index('level_1')
#Change the coulmn name Open to Days Open
df1.rename(columns={'Open':"Day's Open"}, inplace=True)
#Change the index name level_1 to None
df1.index.name = None
#Make a dataframe with Open column from initial ohlcv dataframe
df2 = pd.DataFrame(df['Open'])
#Make column named Days Open with Nan value
df2["Day's Open"]=np.nan
#Update the datafarame with daily open data
df2.update(df1)
#Fill the NAN values with last non-nan value
df2.fillna(method='ffill')
I'm using yfinance to download the price history for multiple symbols, which returns a dataframe with multiple indexes. For example:
import yfinance as yf
df = yf.download(tickers = ['AAPL', 'MSFT'], period = '2d')
A similar dataframe could be constructed without yfinance like:
import pandas as pd
pd.options.display.float_format = '{:.2f}'.format
import numpy as np
attributes = ['Adj Close', 'Close', 'High', 'Low', 'Open', 'Volume']
symbols = ['AAPL', 'MSFT']
dates = ['2020-07-23', '2020-07-24']
data = [[[371.38, 202.54], [371.38, 202.54], [388.31, 210.92], [368.04, 202.15], [387.99, 207.19], [49251100, 67457000]],
[[370.46, 201.30], [370.46, 201.30], [371.88, 202.86], [356.58, 197.51 ], [363.95, 200.42], [46323800, 39799500]]]
data = np.array(data).reshape(len(dates), len(symbols) * len(attributes))
cols = pd.MultiIndex.from_product([attributes, symbols])
df = pd.DataFrame(data, index=dates, columns=cols)
df
Output:
Adj Close Close High Low Open Volume
AAPL MSFT AAPL MSFT AAPL MSFT AAPL MSFT AAPL MSFT AAPL MSFT
2020-07-23 371.38 202.54 371.38 202.54 388.31 210.92 368.04 202.15 387.99 207.19 49251100.0 67457000.0
2020-07-24 370.46 201.30 370.46 201.30 371.88 202.86 356.58 197.51 363.95 200.42 46323800.0 39799500.0
Once I have this dataframe, I want to restructure it so that I have a row for each symbol and date. I'm currently doing this by looping through a list of symbols and calling the API once each time, and appending the results. I'm sure there must be a more efficient way:
df = pd.DataFrame()
symbols = ['AAPL', 'MSFT']
for x in range(0, len(symbols)):
symbol = symbols[x]
result = yf.download(tickers = symbol, start = '2020-07-23', end = '2020-07-25')
result.insert(0, 'symbol', symbol)
df = pd.concat([df, result])
Example of the desired output:
df
symbol Open High Low Close Adj Close Volume
Date
2020-07-23 AAPL 387.989990 388.309998 368.040009 371.380005 371.380005 49251100
2020-07-24 AAPL 363.950012 371.880005 356.579987 370.459991 370.459991 46323800
2020-07-23 MSFT 207.190002 210.919998 202.149994 202.539993 202.539993 67457000
2020-07-24 MSFT 200.419998 202.860001 197.509995 201.300003 201.300003 39799500
This looks like a simple stacking operation. Let's go with
df = yf.download(tickers = ['AAPL', 'MSFT'], period = '2d') # Get your data
df.stack(level=1).rename_axis(['Date', 'symbol']).reset_index(level=1)
Output:
symbol Adj Close ... Open Volume
Date ...
2020-07-23 AAPL 371.380005 ... 387.989990 49251100
2020-07-23 MSFT 202.539993 ... 207.190002 67457000
2020-07-24 AAPL 370.459991 ... 363.950012 46323800
2020-07-24 MSFT 201.300003 ... 200.419998 39799500
[4 rows x 7 columns]
I am having a lot of trouble merging these dataframes on the same index which are all in the same for loop. Below when I print my code I will get two dataframes under the for loop, I want to do sometype of dataframe.merge() where I can get it to look like this.
# price1 when printed in the for loop
Close tic
Date
2010-05-27 31.33 AAPL
2010-05-28 31.77 AAPL
... ... ...
2020-05-22 318.89 AAPL
2020-05-26 316.73 AAPL
[2516 rows x 2 columns]
Close tic
Date
2010-05-27 38.54 TROW
2010-05-28 37.08 TROW
... ... ...
2020-05-22 115.09 TROW
2020-05-26 120.05 TROW
[2516 rows x 2 columns]
Next is what I want it to look like where they would be merged on the index. Where the new columns are the new dataframe.
#what I want it to look like
Close tic Close tic
Date
2010-05-27 31.33 AAPL 38.54 TROW
2010-05-28 31.77 AAPL 37.08 TROW
... ... ...
2020-05-22 318.89 AAPL 115.09 TROW
2020-05-26 316.73 AAPL 120.05 TROW
[2516 rows x 4 columns]
My code below is.
import yfinance as yf
import pandas as pd
import csv
def price(ticker):
company = yf.Ticker(ticker)
price = company.history(period="10y")
price_df = pd.DataFrame(price)
price_df.drop(price_df.columns[[0,1,2,4,5,6]], axis = 1, inplace = True)
price_df['tic'] = (ticker)
return price_df
l = ["AAPL", "TROW"]
for ticker in l:
price1 = price(ticker)
print(price1)
Thx in advance
Make sure your "date" column is the dataframe index
df1.set_index('date')
df2.set_index('date')
then merge two frames
df_merged = pd.concat((df1,df2) , axis =1)
According to your sample data, simply
price1 = price('AAPL').join(price('TROW'))
print(price1)
may work fine.
In more complicated cases, comments from CypherX could be considered.
I want to add columns to the following Dataframe for each stock of 5 year (60 month) rolling returns. The following code is used to obtain the financial data over the period 1995 to 2010.
quandl.ApiConfig.api_key = 'Enter Key'
stocks = ['MSFT', 'AAPL', 'WMT', 'GE', 'KO']
stockdata = quandl.get_table('WIKI/PRICES', ticker = stocks, paginate=True,
qopts = { 'columns': ['date', 'ticker', 'adj_close'] },
date = { 'gte': '1995-1-1', 'lte': '2010-12-31' })
# Setting date as index with columns of tickers and adjusted closing price
df = stockdata.pivot(index = 'date',columns='ticker')
df.index = pd.to_datetime(df.index)
df.resample('1M').mean()
df = df.pct_change()
df.head()
Out[1]:
rets
ticker AAPL BA F GE JNJ KO
date
1995-01-03 NaN NaN NaN NaN NaN NaN
1995-01-04 0.026055 -0.002567 0.026911 0.000000 0.006972 -0.019369
1995-01-05 -0.012697 0.002573 -0.008735 0.002549 -0.002369 -0.004938
1995-01-06 0.080247 0.018824 0.000000 -0.004889 -0.006758 0.000000
1995-01-09 -0.019048 0.000000 0.017624 -0.009827 -0.011585 -0.014887
df.tail()
Out[2]:
rets
ticker AAPL BA F GE JNJ KO
date
2010-12-27 0.003337 -0.004765 0.005364 0.008315 -0.005141 -0.007777
2010-12-28 0.002433 0.001699 -0.008299 0.007147 0.001938 0.004457
2010-12-29 -0.000553 0.002929 0.000598 -0.002729 0.001289 0.001377
2010-12-30 -0.005011 -0.000615 -0.002987 -0.004379 -0.003058 0.000764
2010-12-31 -0.003399 0.003846 0.005992 0.005498 -0.001453 0.004122
Any assistance of how to do this would be awesome!
The problem is in the multi-level index in the columns. We can start by selecting the second level index, and after that the rolling mean works:
means = df['rets'].rolling(60).mean()
means.tail()
Gives:
The error you are receiving is due to you passing the entire dataframe into the rolling function since your frame uses a multi index. You cant pass a multi index frame to a rolling function since rolling only accepts numpy arrays of 1 column. You’ll have to probably create a for loop and return the values individually per ticker
I have a dataframe of historical stock trades. The frame has columns like ['ticker', 'date', 'cusip', 'profit', 'security_type']. Initially:
trades['cusip'] = np.nan
trades['security_type'] = np.nan
I have historical config files that I can load into frames that have columns like ['ticker', 'cusip', 'date', 'name', 'security_type', 'primary_exchange'].
I would like to UPDATE the trades frame with the cusip and security_type from config, but only where the ticker and date match.
I thought I could do something like:
pd.merge(trades, config, on=['ticker', 'date'], how='left')
But that doesn't update the columns, it just adds the config columns to trades.
The following works, but I think there has to be a better way. If not, I will probably do it outside of pandas.
for date in trades['date'].unique():
config = get_config_file_as_df(date)
## config['date'] == date
for ticker in trades['ticker'][trades['date'] == date]:
trades['cusip'][
(trades['ticker'] == ticker)
& (trades['date'] == date)
] \
= config['cusip'][config['ticker'] == ticker].values[0]
trades['security_type'][
(trades['ticker'] == ticker)
& (trades['date'] == date)
] \
= config['security_type'][config['ticker'] == ticker].values[0]
Suppose you have this setup:
import pandas as pd
import numpy as np
import datetime as DT
nan = np.nan
trades = pd.DataFrame({'ticker' : ['IBM', 'MSFT', 'GOOG', 'AAPL'],
'date' : pd.date_range('1/1/2000', periods = 4),
'cusip' : [nan, nan, 100, nan]
})
trades = trades.set_index(['ticker', 'date'])
print(trades)
# cusip
# ticker date
# IBM 2000-01-01 NaN
# MSFT 2000-01-02 NaN
# GOOG 2000-01-03 100 # <-- We do not want to overwrite this
# AAPL 2000-01-04 NaN
config = pd.DataFrame({'ticker' : ['IBM', 'MSFT', 'GOOG', 'AAPL'],
'date' : pd.date_range('1/1/2000', periods = 4),
'cusip' : [1,2,3,nan]})
config = config.set_index(['ticker', 'date'])
# Let's permute the index to show `DataFrame.update` correctly matches rows based on the index, not on the order of the rows.
new_index = sorted(config.index)
config = config.reindex(new_index)
print(config)
# cusip
# ticker date
# AAPL 2000-01-04 NaN
# GOOG 2000-01-03 3
# IBM 2000-01-01 1
# MSFT 2000-01-02 2
Then you can update NaN values in trades with values from config using the DataFrame.update method. Note that DataFrame.update matches rows based on indices (which is why set_index was called above).
trades.update(config, join = 'left', overwrite = False)
print(trades)
# cusip
# ticker date
# IBM 2000-01-01 1
# MSFT 2000-01-02 2
# GOOG 2000-01-03 100 # If overwrite = True, then 100 is overwritten by 3.
# AAPL 2000-01-04 NaN