How to rename columns in DataFrame with pandas in Python - python

I have five stock portfolios that I have imported from Yahoo! finance and need to create a DataFrame with the closing prices for 2016 of all of the stocks. However, I'm struggling to label the columns with the corresponding stock names.
import pandas.io.data as web
import pandas_datareader.data as web
import pandas as pd
from pandas import Series, DataFrame
import numpy as np
import datetime
start = datetime.datetime(2016, 1, 1)
end = datetime.datetime(2016, 12, 31)
NFLX = web.DataReader("NFLX", 'yahoo', start, end)
AAPL = web.DataReader("AAPL", 'yahoo', start, end)
GOOGL = web.DataReader("GOOGL", 'yahoo', start, end)
FB = web.DataReader("FB", 'yahoo', start, end)
TSLA = web.DataReader("TSLA", 'yahoo', start, end)
df_NFLX = pd.DataFrame(NFLX['Close'])
df_AAPL = pd.DataFrame(AAPL['Close'])
df_GOOGL = pd.DataFrame(GOOGL['Close'])
df_FB = pd.DataFrame(FB['Close'])
df_TSLA = pd.DataFrame(TSLA['Close'])
frames = [df_NFLX, df_AAPL, df_GOOGL, df_FB, df_TSLA]
result = pd.concat(frames, axis = 1)
result = result.rename(columns = {'Two':'N'})
result
My code produces this - and I want to title each column accordingly.
Out[15]:
Close Close Close Close Close
Date
2016-01-04 109.959999 105.349998 759.440002 102.220001 223.410004
2016-01-05 107.660004 102.709999 761.530029 102.730003 223.429993
2016-01-06 117.680000 100.699997 759.330017 102.970001 219.039993
2016-01-07 114.559998 96.449997 741.000000 97.919998 215.649994
2016-01-08 111.389999 96.959999 730.909973 97.330002 211.000000
2016-01-11 114.970001 98.529999 733.070007 97.510002 207.850006
2016-01-12 116.580002 99.959999 745.340027 99.370003 209.970001

A simple way to patch up the code you've written is to just assign a list of names to df.columns.
df.columns = ['NFLX', 'AAPL', 'GOOGL', 'FB', 'TSLA']
However, there are ways to make large chunks of your code more concise which also allow you to specify the stock names as column names cleanly. I would go back to the beginning and (after defining start and end) start by creating a list of the stock tickers you want to fetch.
start = datetime.datetime(2016, 1, 1)
end = datetime.datetime(2016, 12, 31)
tickers = ['NFLX', 'AAPL', 'GOOGL', 'FB', 'TSLA']
Then you can construct all the data frames in a loop of some kind. If you want only the Close column, you can extract that column immediately, and in fact you can make a dict out of all these columns and then construct a DataFrame directly from that dict.
result = DataFrame({t: web.DataReader(t, 'yahoo', start, end)['Close']
for t in tickers})
An alternative would be to put all the stock data in a Panel, which would be useful if you might want to work with other columns.
p = pd.Panel({t: web.DataReader(t, 'yahoo', start, end) for t in tickers})
Then you can extract the Close figures with
result = p[:,:,'Close']
You'll notice it has the proper column labels automatically.

To rename the columns in the constructed table, you can change this:
df_NFLX = pd.DataFrame(NFLX['Close'])
to this:
df_NFLX = pd.DataFrame(NFLX['Close']).rename(columns={'Close': 'NFLX'})

Related

Loop to import and save each stock with its respective name through Yahoo Finance

I have a doubt:
I am using Yahoo Finance to import information about some stocks. I have a "ticker" list where I have the names of the stocks to import and I use pdr.get_data_yahoo. My code is the following:
import pandas as pd
import datetime as dt
import pandas_datareader as pdr
ticker = ['AAPL', 'AMZN', 'FB'] # Acciones
start = dt.datetime(2020, 11, 21) # Desde
end = dt.datetime(2020, 12, 4) # Hasta
AAPL = pdr.get_data_yahoo('AAPL', start = start, end = end) # a)
AMZN = pdr.get_data_yahoo('AMZN', start = start, end = end) # b)
FB = pdr.get_data_yahoo('FB', start = start, end = end) # c)
Instead of importing stock by stock as in a), b) and c), how could I do a loop or iteration to save the information of each action in a variable with its respective name (equal to the str of the ticker). That is, I want to be able to modify and / or add actions to the ticker and that each one is saved with its respective name
I was trying a loop like
ticker = ['AAPL', 'AMZN', 'FB', 'MSFT', 'AAL']
for ticker in tickers:
ticker[i] = pdr.get_data_yahoo(ticker, start = start, end = end)
You have two problems
first: ticker is list so you can't use it as dictonary ticker[i].
second: you use tickers (with s) but you don't have this variable.
You simply mess two similar names. I will use name all_tickers and ticker to better show problem.
And use dictionary for results
all_tickers = ['AAPL', 'AMZN', 'FB', 'MSFT', 'AAL']
results = {}
for ticker in all_tickers:
results[ticker] = pdr.get_data_yahoo(ticker, start=start, end=end)
EDIT:
Don't use variables AAPL, AMZN and FB but keep it in dictinary as results["AAPL"], results["AMZN"], results["FB"]
Usually you will work with all tickets like this
for ticker in all_tickers:
print(ticker, results[ticker])
or
max_value = {}
for ticker in all_tickers:
max_value[ticker] = max( results[ticker] )
#print('max value for', ticker, 'is', max_value[ticker])
for ticker in all_tickers:
print('max value for', ticker, 'is', max_value[ticker])
or
my_favorite = ['AMZN', 'FB']
for ticker in my_favorite:
print(ticker, results[ticker])
etc.
If you add new ticket to all_tickers then rest code will work the same.
Eventually you could add all to pandas.DataFrame and then you could work with all values even without for-loop.
You can use a dictionary and save the ticker data with the ticker name in it.
Example:
import pandas as pd
import datetime as dt
import pandas_datareader as pdr
ticker = ['AAPL', 'AMZN', 'FB'] # Acciones
ticker_data = {}
start = dt.datetime(2020, 11, 21) # Desde
end = dt.datetime(2020, 12, 4) # Hasta
for ticker in tickers:
ticker_data[ticker] = pdr.get_data_yahoo(ticker, start = start, end = end)
print(ticker_data)

How to start a for loop for this given DataFrame in Pandas for multiple same name rows?

I need some help, I am working on a .ipynb file to filter data and get certain things from that Dataframe.
This is DataFrame I'm working with.
From this dataframe, as you can see there are multiple rows of the same SYMBOL.
I need help to open a "for" loop which will get me the highest CHG_IN_OI for every symbol, take the row of that highest CHG_IN_OI for that row.
For example if there are 14 rows of ACC as a symbol, I need to find highest CHG_IN_OI for ACC from the CHG_IN_OI column and get that row of the highest change and Retain the remaining columns as well!.
I have made a list named, Multisymbols which has these symbols:
multisymbols = [
'ACC',
'ADANIENT',
'ADANIPORTS',
'AMARAJABAT',
'AMBUJACEM',
'APOLLOHOSP',
'APOLLOTYRE',
'ASHOKLEY',
'ASIANPAINT',
'AUROPHARMA',
'AXISBANK',
'BAJAJ-AUTO',
'BAJAJFINSV',
'BAJFINANCE',
'BALKRISIND',
'BANDHANBNK',
'BANKBARODA',
'BATAINDIA',
'BEL',
'BERGEPAINT',
'BHARATFORG',
'BHARTIARTL',
'BHEL',
'BIOCON',
'BOSCHLTD',
'BPCL',
'BRITANNIA',
'CADILAHC',
'CANBK',
'CENTURYTEX',
'CHOLAFIN',
'CIPLA',
'COALINDIA',
'COLPAL',
'CONCOR',
'CUMMINSIND',
'DABUR',
'DIVISLAB',
'DLF',
'DRREDDY',
'EICHERMOT',
'EQUITAS',
'ESCORTS',
'EXIDEIND',
'FEDERALBNK',
'GAIL',
'GLENMARK',
'GMRINFRA',
'GODREJCP',
'GODREJPROP',
'GRASIM',
'HAVELLS',
'HCLTECH',
'HDFC',
'HDFCBANK',
'HDFCLIFE',
'HEROMOTOCO',
'HINDALCO',
'HINDPETRO',
'HINDUNILVR',
'IBULHSGFIN',
'ICICIBANK',
'ICICIPRULI',
'IDEA',
'IDFCFIRSTB',
'IGL',
'INDIGO',
'INDUSINDBK',
'INFRATEL',
'INFY',
'IOC',
'ITC',
'JINDALSTEL',
'JSWSTEEL',
'JUBLFOOD',
'KOTAKBANK',
'L&TFH',
'LICHSGFIN',
'LT',
'LUPIN',
'M&M',
'M&MFIN',
'MANAPPURAM',
'MARICO',
'MARUTI',
'MCDOWELL-N',
'MFSL',
'MGL',
'MINDTREE',
'MOTHERSUMI',
'MRF',
'MUTHOOTFIN',
'NATIONALUM',
'NAUKRI',
'NESTLEIND',
'NIITTECH',
'NMDC',
'NTPC',
'ONGC',
'PAGEIND',
'PEL',
'PETRONET',
'PFC',
'PIDILITIND',
'PNB',
'POWERGRID',
'PVR',
'RAMCOCEM',
'RBLBANK',
'RECLTD',
'RELIANCE',
'SAIL',
'SBILIFE',
'SBIN',
'SHREECEM',
'SEIMENS',
'SRF',
'SRTRANSFIN',
'SUNPHARMA',
'SUNTV',
'TATACHEM',
'TATACONSUM',
'TATAMOTORS',
'TATAPOWER',
'TATASTEEL',
'TCS',
'TECHM',
'TITAN',
'TORNTPHARM',
'TORNTPOWER',
'TVSMOTOR',
'UBL',
'UJJIVAN',
'ULTRACEMCO',
'UPL',
'VEDL',
'VOLTAS',
'WIPRO',
'ZEEL'
]
df = df[df['SYMBOL'].isin(multisymbols)]
df
These are all the shares in the NSE. Hope you can understand and help me out. I used .groupby(),it successfully gave me the highest CHG_IN_OI and .agg() to retain the remaining columns but the data was not correct. I just simply want the row for every symbols "HIGHEST" CHG_IN_OI.
Thanks in Advance!
Although different from the data presented in the question, we have answered the same financial data using equity data as an example.
import pandas as pd
import pandas_datareader.data as web
import datetime
with open('./alpha_vantage_api_key.txt') as f:
api_key = f.read()
start = datetime.datetime(2019, 1, 1)
end = datetime.datetime(2020, 8,1)
df_all = pd.DataFrame()
symbol = ['AAPL','TSLA']
for i in symbol:
df = web.DataReader(i, 'av-daily', start, end, api_key=api_key)
df['symbol'] = i
df_all = pd.concat([df_all, df], axis=0)
df.index = pd.to_datetime(df.index)
Aggregating a single column
df_all.groupby('symbol')['volume'].agg('max').reset_index()
symbol volume
0 AAPL 106721200
1 TSLA 60938758
Multi-Column Aggregation
df_all.groupby('symbol')[['high','volume']].agg(high=('high','max'), volume=('volume','max'))
high volume
symbol
AAPL 425.66 106721200
TSLA 1794.99 60938758
Extract the target line
symbol_max = df_all.groupby('symbol').apply(lambda x: x.loc[x['volume'].idxmax()]).reset_index(drop=True)
symbol_max
open high low close volume symbol
0 257.26 278.4100 256.37 273.36 106721200 AAPL
1 882.96 968.9899 833.88 887.06 60938758 TSLA

Grabbing per minute stock data from a large time range Python

So I'm trying to grab per minute stock data over a one year time gap and I know the Google Finance API doesn't work anymore so I did some digging around I found some code from a old github thread that could find the range within 5 days from yahoo finance data; however, it does not do anymore than that even when I put a keyword like '1Y' which defaults to 1 day. Here is the code below:
import requests
import pandas as pd
import arrow
import datetime
import os
def get_quote_data(symbol='AAPL', data_range='5d', data_interval='1m'):
res = requests.get('https://query1.finance.yahoo.com/v8/finance/chart/{symbol}?range={data_range}&interval={data_interval}'.format(**locals()))
data = res.json()
body = data['chart']['result'][0]
dt = datetime.datetime
dt = pd.Series(map(lambda x: arrow.get(x).datetime.replace(tzinfo=None), body['timestamp']), name='Datetime')
df = pd.DataFrame(body['indicators']['quote'][0], index=dt)
dg = pd.DataFrame(body['timestamp'])
df = df.loc[:, ('open', 'high', 'low', 'close', 'volume')]
df.dropna(inplace=True) #removing NaN rows
df.columns = ['OPEN', 'HIGH','LOW','CLOSE','VOLUME'] #Renaming columns in pandas
return df
body['meta']['validRanges'] tells you:
['1d', '5d', '1mo', '3mo', '6mo', '1y', '2y', '5y', '10y', 'ytd', 'max']
You are requesting 1Y instead of 1y. This difference is important.
By the way you can load the timestamps much more easily like this:
pd.to_datetime(body['timestamp'], unit='s')
print('stock ticker: {0}'.format(get_quote_data(symbol='AAPL', data_range='1d', data_interval='1m')))
works

Pandas Yahoo Stock API

I am new to Pandas (and Python) and trying to working with the Yahoo API for stock prices.
I need to get the data, loop through it and grab the dates and values.
here is the code
df = pd.get_data_yahoo( symbols = 'AAPL',
start = datetime( 2011, 1, 1 ),
end = datetime( 2012, 1, 1 ),
interval = 'm' )
results are:
df
Open High Low Close Volume
Date
2011-01-03 325.640015 348.600006 324.840027 339.320007 140234700
2011-02-01 341.299988 364.899994 337.720001 353.210022 127618700
2011-03-01 355.470001 361.669983 326.259979 348.510010 125874700
I can get the dates but not the month date value because it is the index(?)
How best to loop through the data for this information? This is about processing the data and not sorting or searching it.
If you need to iterate over the rows in your dataframe, and do some processing, then pandas.DataFrame.apply() works great.
Code:
Some mock processing code...
def process_data(row):
# the index becomes the name when converted to a series (row)
print(row.name.month, row.Close)
Test Code:
import datetime as dt
from pandas_datareader import data
df = data.get_data_yahoo(
'AAPL',
start=dt.datetime(2011, 1, 1),
end=dt.datetime(2011, 5, 1),
interval='m')
print(df)
# process each row
df.apply(process_data, axis=1)
Results:
Open High Low Close Volume \
Date
2011-01-03 325.640015 348.600006 324.840027 339.320007 140234700
2011-02-01 341.299988 364.899994 337.720001 353.210022 127618700
2011-03-01 355.470001 361.669983 326.259979 348.510010 125874700
2011-04-01 351.110016 355.130005 320.160004 350.130005 128252100
Adj Close
Date
2011-01-03 43.962147
2011-02-01 45.761730
2011-03-01 45.152802
2011-04-01 45.362682
1 339.320007
2 353.210022
3 348.51001
4 350.130005
here is what made my life groovy when trying to work with the data from Yahoo.
First was getting the date from the dataframe index.
df = df.assign( date = df.index.date )
here are a few others I found helpful from dealing with the data.
df [ 'diff' ] = df [ 'Close' ].diff( )
df [ 'pct_chg' ] = df [ 'Close' ].pct_change()
df [ 'hl' ] = df [ 'High' ] - df [ 'Low' ]
Pandas is amazing stuff.
I believe this should work for you.
import pandas_datareader.data as web
import datetime
start = datetime.datetime(2013, 1, 1)
end = datetime.datetime(2016, 1, 27)
df = web.DataReader("GOOGL", 'yahoo', start, end)
dates =[]
for x in range(len(df)):
newdate = str(df.index[x])
newdate = newdate[0:10]
dates.append(newdate)
df['dates'] = dates
print df.head()
print df.tail()
Also, take a look at the link below for more helpful hints of how to do these kinds of things.
https://pandas-datareader.readthedocs.io/en/latest/remote_data.html#yahoo-finance
from pandas_datareader import data as pdr
from datetime import date
import yfinance as yf
yf.pdr_override()
import pandas as pd
import requests
import json
from os import listdir
from os.path import isfile, join
# Tickers List
tickers_list = ['AAPL', 'GOOGL','FB', 'WB' , 'MO']
today = date.today()
# We can get data by our choice by giving days bracket
start_date= "2010-01-01"
files=[]
def getData(ticker):
print (ticker)
data = pdr.get_data_yahoo(ticker, start=start_date, end=today)
dataname= ticker+'_'+str(today)
files.append(dataname)
SaveData(data, dataname)
# Create an data folder to save these data file in data folder.
def SaveData(df, filename):
df.to_csv('./data/'+filename+'.csv')
for tik in tickers_list:
getData(tik)

Python Pandas join dataframes on index

I am trying to join to dataframe on the same column "Date", the code is as follow:
import pandas as pd
from datetime import datetime
df_train_csv = pd.read_csv('./train.csv',parse_dates=['Date'],index_col='Date')
start = datetime(2010, 2, 5)
end = datetime(2012, 10, 26)
df_train_fly = pd.date_range(start, end, freq="W-FRI")
df_train_fly = pd.DataFrame(pd.Series(df_train_fly), columns=['Date'])
merged = df_train_csv.join(df_train_fly.set_index(['Date']), on = ['Date'], how = 'right', lsuffix='_x')
It complains dataframe df_train_csv has no column named "Date". I'd like to set "Date" in both dataframe as index and I am wondering what is the best way to join dataframe with date as the index?
UPDATE:
That is the sample data
Date,Weekly_Sales
2010-02-05,24924.5
2010-02-12,46039.49
2010-02-19,41595.55
2010-02-26,19403.54
2010-03-05,21827.9
2010-03-12,21043.39
2010-03-19,22136.64
2010-03-26,26229.21
2010-04-02,57258.43
2010-04-09,42960.91
2010-04-16,17596.96
2010-04-23,16145.35
2010-04-30,16555.11
2010-05-07,17413.94
2010-05-14,18926.74
2010-05-21,14773.04
2010-05-28,15580.43
2010-06-04,17558.09
2010-06-11,16637.62
2010-06-18,16216.27
2010-06-25,16328.72
2010-07-02,16333.14
2010-07-09,17688.76
2010-07-16,17150.84
2010-07-23,15360.45
2010-07-30,15381.82
2010-08-06,17508.41
2010-08-13,15536.4
2010-08-20,15740.13
2010-08-27,15793.87
2010-09-03,16241.78
2010-09-10,18194.74
2010-09-17,19354.23
2010-09-24,18122.52
2010-10-01,20094.19
2010-10-08,23388.03
2010-10-15,26978.34
2010-10-22,25543.04
2010-10-29,38640.93
2010-11-05,34238.88
2010-11-12,19549.39
2010-11-19,19552.84
2010-11-26,18820.29
2010-12-03,22517.56
2010-12-10,31497.65
2010-12-17,44912.86
2010-12-24,55931.23
2010-12-31,19124.58
2011-01-07,15984.24
2011-01-14,17359.7
2011-01-21,17341.47
2011-01-28,18461.18
2011-02-04,21665.76
2011-02-11,37887.17
2011-02-18,46845.87
2011-02-25,19363.83
2011-03-04,20327.61
2011-03-11,21280.4
2011-03-18,20334.23
2011-03-25,20881.1
2011-04-01,20398.09
2011-04-08,23873.79
2011-04-15,28762.37
2011-04-22,50510.31
2011-04-29,41512.39
2011-05-06,20138.19
2011-05-13,17235.15
2011-05-20,15136.78
2011-05-27,15741.6
2011-06-03,16434.15
2011-06-10,15883.52
2011-06-17,14978.09
2011-06-24,15682.81
2011-07-01,15363.5
2011-07-08,16148.87
2011-07-15,15654.85
2011-07-22,15766.6
2011-07-29,15922.41
2011-08-05,15295.55
2011-08-12,14539.79
2011-08-19,14689.24
2011-08-26,14537.37
2011-09-02,15277.27
2011-09-09,17746.68
2011-09-16,18535.48
2011-09-23,17859.3
2011-09-30,18337.68
2011-10-07,20797.58
2011-10-14,23077.55
2011-10-21,23351.8
2011-10-28,31579.9
2011-11-04,39886.06
2011-11-11,18689.54
2011-11-18,19050.66
2011-11-25,20911.25
2011-12-02,25293.49
2011-12-09,33305.92
2011-12-16,45773.03
2011-12-23,46788.75
2011-12-30,23350.88
2012-01-06,16567.69
2012-01-13,16894.4
2012-01-20,18365.1
2012-01-27,18378.16
2012-02-03,23510.49
2012-02-10,36988.49
2012-02-17,54060.1
2012-02-24,20124.22
2012-03-02,20113.03
2012-03-09,21140.07
2012-03-16,22366.88
2012-03-23,22107.7
2012-03-30,28952.86
2012-04-06,57592.12
2012-04-13,34684.21
2012-04-20,16976.19
2012-04-27,16347.6
2012-05-04,17147.44
2012-05-11,18164.2
2012-05-18,18517.79
2012-05-25,16963.55
2012-06-01,16065.49
2012-06-08,17666
2012-06-15,17558.82
2012-06-22,16633.41
2012-06-29,15722.82
2012-07-06,17823.37
2012-07-13,16566.18
2012-07-20,16348.06
2012-07-27,15731.18
2012-08-03,16628.31
2012-08-10,16119.92
2012-08-17,17330.7
2012-08-24,16286.4
2012-08-31,16680.24
2012-09-07,18322.37
2012-09-14,19616.22
2012-09-21,19251.5
2012-09-28,18947.81
2012-10-05,21904.47
2012-10-12,22764.01
2012-10-19,24185.27
2012-10-26,27390.81
I will read it from a csv file. But sometimes, some weeks may be missing. Therefore, I am trying to generate a date range like this:
df_train_fly = pd.date_range(start, end, freq="W-FRI")
This generated dataframe contains all weeks in the range so I need to merge those two dataframe into one.
If I check df_train_csv['Date'] and df_train_fly['Date'] from the iPython console, they both showed as dtype: datetime64[ns]
So let's dissect this:
df_train_csv = pd.read_csv('./train.csv',parse_dates=['Date'],index_col='Date')
OK first problem here is you have specified that the index column should be 'Date' this means that you will not have a 'Date' column anymore.
start = datetime(2010, 2, 5)
end = datetime(2012, 10, 26)
df_train_fly = pd.date_range(start, end, freq="W-FRI")
df_train_fly = pd.DataFrame(pd.Series(df_train_fly), columns=['Date'])
merged = df_train_csv.join(df_train_fly.set_index(['Date']), on = ['Date'], how = 'right', lsuffix='_x')
So the above join will not work as the error reported so in order to fix this:
# remove the index_col param
df_train_csv = pd.read_csv('./train.csv',parse_dates=['Date'])
# don't set the index on df_train_fly
merged = df_train_csv.join(df_train_fly, on = ['Date'], how = 'right', lsuffix='_x')
OR don't set the 'on' param:
merged = df_train_csv.join(df_train_fly, how = 'right', lsuffix='_x')
the above will use the index of both df's to join on
You can also achieve the same result by performing a merge instead:
merged = df_train_csv.merge(df_train_fly.set_index(['Date']), left_index=True, right_index=True, how = 'right', lsuffix='_x')

Categories

Resources