how to pick out individual columns of numerical values from Datareader pandas?

how to pick out individual columns of numerical values from Datareader pandas? - python

import pandas.io.data as web
import datetime
import matplotlib.pyplot as plt
start = datetime.datetime.strptime('2/10/2016', '%m/%d/%Y')
end = datetime.datetime.strptime('2/24/2016', '%m/%d/%Y')
f = web.DataReader(['GOOG','AAPL'], 'yahoo', start, end)
#print 'Volume'
wha = f[['Adj Close']] #pick out Adj Close
x=wha[0,:]
print x.shape
ax = f['Adj Close'].plot(grid=True, fontsize=10, rot=45.)
ax.set_ylabel('Adjusted Closing Price ($)')
plt.legend(loc='upper center', ncol=2, bbox_to_anchor=(0.5,1.1), shadow=True, fancybox=True, prop={'size':10})
#plt.show()
As you can see above, I'm trying to pick out numerical values of individual stock prices for data manipulation.
with
#print wha[1,:]
x=wha[0,:]
print x.shape
i could get it down to a 9x2 matrix where you have two columns for GOOG and AAPL and 9 prices each.
I tried
print type(x)
and see that it's
<class 'pandas.core.frame.DataFrame'>
and by means of
wha2=x.values.tolist()
i was able to pick out the stock prices.
Is there an easy way for me to now plot prices of one stock (AAPL alone for example) vs Dates ?

What more tractable for data manipulation than a Pandas dataframe?!?
>>> f['Adj Close'].iloc[:8, :2]
AAPL GOOG
Date
2016-02-10 94.269997 684.119995
2016-02-11 93.699997 683.109985
2016-02-12 93.989998 682.400024
2016-02-16 96.639999 691.000000
2016-02-17 98.120003 708.400024
2016-02-18 96.260002 697.349976
2016-02-19 96.040001 700.909973
2016-02-22 96.879997 706.460022
From your panel data, I first select the column Adj Close. I then used iloc for index based location filtering, selecting rows 0-8 and columns 0-1.
To just get adj close for Apple:
>>> f['Adj Close'].loc[:, 'AAPL']
Date
2016-02-10 94.269997
2016-02-11 93.699997
2016-02-12 93.989998
2016-02-16 96.639999
2016-02-17 98.120003
2016-02-18 96.260002
2016-02-19 96.040001
2016-02-22 96.879997
2016-02-23 94.690002
Name: AAPL, dtype: float64
Here is a link to indexing in the documentation.
http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-and-selecting-data
>>> f['Adj Close'].corr()
AAPL GOOG
AAPL 1.00000 0.87332
GOOG 0.87332 1.00000

Related

How to plot a variable dataframe

I have a dataframe with a variable number of stock prices. In other words, I have to be able to plot the entire Dataframe, because I may encounter 1 to 10 stocks prices.
The x axis are dates, the Y axis are Stock prices. Here is a sample of my Df:
df = pd.DataFrame(all_Assets)
df2 = df.transpose()
print(df2)
Close Close Close
Date
2018-12-12 00:00:00-05:00 40.802803 24.440001 104.500526
2018-12-13 00:00:00-05:00 41.249191 25.119333 104.854965
2018-12-14 00:00:00-05:00 39.929325 24.380667 101.578560
2018-12-17 00:00:00-05:00 39.557732 23.228001 98.570381
2018-12-18 00:00:00-05:00 40.071678 22.468666 99.605057
This is not working
fig = go.Figure(data=go.Scatter(df2, mode='lines'),)
I need to plot this entire dataframe on a single chart, with 3 different lines. But the code has to adapt automatically if there is a fourth stock, fifth stock e.g. By the way , I want it to be a Logarithmic plot.

There is a sample in the reference, so let's try to graph it in wide and long format with express and in wide and long format with the graph object. You can choose from these four types to do what you need.
express wide format
df.head()
date GOOG AAPL AMZN FB NFLX MSFT
0 2018-01-01 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000
1 2018-01-08 1.018172 1.011943 1.061881 0.959968 1.053526 1.015988
2 2018-01-15 1.032008 1.019771 1.053240 0.970243 1.049860 1.020524
3 2018-01-22 1.066783 0.980057 1.140676 1.016858 1.307681 1.066561
4 2018-01-29 1.008773 0.917143 1.163374 1.018357 1.273537 1.040708
import plotly.express as px
df = px.data.stocks()
fig = px.line(df, x='date', y=df.columns[1:])
fig.show()
express long format
df_long = df.melt(id_vars='date', value_vars=df.columns[1:],var_name='ticker')
px.line(df_long, x='date', y='value', color='ticker')
graph_objects wide format
import plotly.graph_objects as go
fig = go.Figure()
for ticker in df.columns[1:]:
fig.add_trace(go.Scatter(x=df['date'], y=df[ticker], name=ticker))
fig.show()
graph_objects long format
fig = go.Figure()
for ticker in df_long.ticker.unique():
dff = df_long.query('ticker == #ticker')
fig.add_trace(go.Scatter(x=dff['date'], y=dff['value'], name=ticker))
fig.show()

I recommend you to use pandas.DataFrame.plot. A minimal working example for your case should be just
df2.plot()
. Then just play around with the plot() method and your df2 dataframe to get exactly the output you need.

What is the most efficient and pythonic way to calculate (TI) over a large data set?

I have tested a few different ways to calculate technical indicators for a large dataframe and am unsure how to determine the most efficient and pythonic way to go about it. The data is stock data (date, price, volume). The goal is to iterate through the dataframe, per ticker, calculating multiple technical indicators, and then sending the result back into the source (SQL db).
The data contains about 4,200 stock symbols with daily price data from 2000 to date (roughly 13m rows x 8 columns).
For testing, I've limited the data to just 2021 date range.
Here is a sample of the data:
Date Open High Low Close Adj_close Volume Tick
529326 2021-01-04 3270.00 3272.00 3144.02 3186.63 3186.63 4411400 AMZN
521846 2021-01-05 3166.01 3223.38 3165.06 3218.51 3218.51 2655500 AMZN
521691 2021-01-06 3146.48 3197.51 3131.16 3138.38 3138.38 4394800 AMZN
514195 2021-01-07 3157.00 3208.54 3155.00 3162.16 3162.16 3514500 AMZN
514038 2021-01-08 3180.00 3190.64 3142.20 3182.70 3182.70 3537700 AMZN
506535 2021-01-11 3148.01 3156.38 3110.00 3114.21 3114.21 3683400 AMZN
506376 2021-01-12 3120.00 3142.14 3086.00 3120.83 3120.83 3514600 AMZN
498871 2021-01-13 3128.44 3189.95 3122.08 3165.89 3165.89 3321200 AMZN
498706 2021-01-14 3167.52 3178.00 3120.59 3127.47 3127.47 3070900 AMZN
491194 2021-01-15 3123.02 3142.55 3095.17 3104.25 3104.25 4244000 AMZN
491037 2021-01-19 3107.00 3145.00 3096.00 3120.76 3120.76 3305100 AMZN
483504 2021-01-20 3181.99 3279.80 3175.00 3263.38 3263.38 5309800 AMZN
483351 2021-01-21 3293.00 3348.55 3289.57 3306.99 3306.99 4936100 AMZN
475802 2021-01-22 3304.31 3321.91 3283.16 3292.23 3292.23 2821900 AMZN
475649 2021-01-25 3328.50 3363.89 3243.15 3294.00 3294.00 3749800 AMZN
468087 2021-01-26 3296.36 3338.00 3282.87 3326.13 3326.13 2955200 AMZN
467939 2021-01-27 3341.49 3346.52 3207.08 3232.58 3232.58 4660200 AMZN
460368 2021-01-28 3235.04 3301.68 3228.69 3237.62 3237.62 3149200 AMZN
460219 2021-01-29 3230.00 3236.99 3184.55 3206.20 3206.20 4293600 AMZN
452618 2021-02-01 3242.36 3350.26 3235.03 3342.88 3342.88 4160200 AMZN
I'm not sure how to fully code dummy data, but here are two methods (should just need Numpy) to create random price and ticker data, I am just unsure how to merge them all into a dataframe. To simulate the same dataframe, there would be 4,200 symbols and 134 days of data.
letters = ('a','b','c','d','e','f','g','h','i','j','k','l','m','n','o','p','q','r','s','t','u','v','w','x','y','z',)
x=np.random.randint(500, size=(134)) # <<< generates random price
y=''.join(np.random.choice(letters) for i in range(4)) # <<< generate random 4 character string
Here are all the imports being used:
#imports
from datetime import datetime, timedelta, date
import time
import sqlalchemy as sa
import pandas as pd
import numpy as np
import yfinance as yf
import pyodbc
import pandas_ta as ta
import talib
Dataframe 'sc' is referenced, with data in the following format:
Date Open High Low Close Adj_close Volume Tick
529377 2021-01-04 38.68 38.69 37.18 37.88 37.88 647700 ACIW
526834 2021-01-04 29.72 29.94 28.68 29.10 29.10 1527600 GOOS
526833 2021-01-04 15.35 15.40 14.92 15.01 14.39 421400 ETV
526832 2021-01-04 42.22 42.36 41.13 41.46 40.84 204000 HMN
526831 2021-01-04 13.94 15.72 13.75 15.38 15.38 880500 GATO
Then I want to iterate through the dataframe (which is sorted by Date) for each ticker and calculate a number of technical indicators. For now I am starting with just two moving average calculations. I've tried 3 different methods and compared the times below.
Talib package: 10 minutes
start_time = time.time()
ticks = pd.unique(sc['Tick'].tolist()) # <<< 4,200 unique tickers
ndf = [] # <<< initialize df
for tick in ticks:
# store ID(symbol), Date, Close(adj_close), and two indicators (SMA,EMA) in variables to
# concat into a temporary df, and then append outside of loop. Not sure if this is most
# efficient/pythonic way to do this.
ID = sc[sc["Tick"]==tick]["Tick"]
DATE = sc[sc["Tick"]==tick]["Date"]
CLOSE = sc[sc["Tick"]==tick]["Adj_close"]
SMA = round(talib.SMA(sc[sc["Tick"]==tick]['Adj_close']),2)
EMA = round(talib.EMA(sc[sc["Tick"]==tick]['Adj_close']),2)
#concat into one df
tempdf = pd.concat([ID, DATE, CLOSE, SMA, EMA], axis=1)
#append into main df outside of loop
ndf.append(tempdf)
print("Completed Indicators for "+tick)
# Concat everything in -ndf into a flattened df (-df)
df = pd.concat(ndf)
df['t_id'] = df['Tick']+'-'+df['Date']
df.rename(columns={'Adj_close':'Close', 0: "SMA", 1: "EMA"},inplace=True)
df=df.sort_values(by='Date')
print("--- %s seconds ---" % (time.time() - start_time))
df.tail(20)
Pandas TA: 13.6 minutes
#Swapped the "talib" lines for Pandas-TA package
SMA5 = ta.sma(sc[sc["Tick"]==tick]['Adj_close'], length=5)
SMA15 = ta.sma(sc[sc["Tick"]==tick]['Adj_close'], length=15)
Rolling method / Pandas: 14 minutes
# Swapped the "talib" lines for Rolling():
SMA5 = sc[sc["Tick"]==tick]['Adj_close'].rolling(5,min_periods=1).mean()
SMA15 = sc[sc["Tick"]==tick]['Adj_close'].rolling(15,min_periods=1).mean()
I am unsure how to gauge what an efficient time would be (is 10 minutes generally good? bad? or is it just dependent on personal requirements?) and if the approach of looping through each ticker, storing each indicator separately, then concating and finally appending back into a master daaframe is an appropriately pythonic approach. The final code will insert the complete dataframe back into a SQL table.

Matplotlib bar chart on datetime index values

I'm having trouble getting the following code to display a bar chart properly. The plot has very thin lines which are not visible until you zoom in, but even then it's not clear. I've tried to control with the width option to plt.bar() but it's not doing anything (e.g. tried 0.1, 1, 365).
Any pointers on what I'm doing wrong would be appreciated.
Many thanks
import pandas as pd
import numpy as np
import matplotlib.pylab as plt
import matplotlib.dates as mdates
plt.close('all')
mydateparser2 = lambda x: pd.datetime.strptime(x, "%m/%d/%Y")
colnames2=['Date','Net sales', 'Cost of sales']
df2 = pd.read_csv(r'account-test.csv', parse_dates = ['Date'] , date_parser = mydateparser2, index_col='Date')
df2= df2.filter(items=colnames2)
df2 = df2.sort_values('Date')
print (df2.info())
print (df2)
fig = plt.figure()
plt.bar(df2.index.values, df2['Net sales'], color='red', label='Net sales' )
plt.ylim(500000,2800000)
plt.show()
plt.legend(loc=4)
Resulting output (to show data types)
<class 'pandas.core.frame.DataFrame'>
DatetimeIndex: 15 entries, 2005-12-31 to 2019-12-31
Data columns (total 2 columns):
Net sales 15 non-null int64
Cost of sales 15 non-null int64
dtypes: int64(2)
memory usage: 360.0 bytes
None
Net sales Cost of sales
Date
2005-12-31 1161400 907200
2006-12-31 1193100 928300
2007-12-31 1171100 888100
2008-12-31 1324900 1035700
2009-12-31 1108300 859800
2010-12-31 1173600 891000
2011-12-31 1392400 1050300
2012-12-31 1578200 1171500
2013-12-31 1678200 1224200
2014-12-31 1855500 1346700
2015-12-31 1861200 1328400
2016-12-31 2004300 1439700
2017-12-31 1973300 1421500
2018-12-31 2189100 1608300
2019-12-31 2355700 1715300

Maybe you are trying to plot too many bars on a small plot. Try fig = plt.figure(figsize=(12,6) to have a bigger plot. You can also pass width=0.9 to your bar command:
fig, ax = plt.subplots(figsize=(12,6))
df.plot.bar(y='Net sales', width=0.9, ax=ax) # modify width to your liking
Output:

Pandas dataframe groupby plot

I have a dataframe which is structured as:
Date ticker adj_close
0 2016-11-21 AAPL 111.730
1 2016-11-22 AAPL 111.800
2 2016-11-23 AAPL 111.230
3 2016-11-25 AAPL 111.790
4 2016-11-28 AAPL 111.570
...
8 2016-11-21 ACN 119.680
9 2016-11-22 ACN 119.480
10 2016-11-23 ACN 119.820
11 2016-11-25 ACN 120.740
...
How can I plot based on the ticker the adj_close versus Date?

Simple plot,
you can use:
df.plot(x='Date',y='adj_close')
Or you can set the index to be Date beforehand, then it's easy to plot the column you want:
df.set_index('Date', inplace=True)
df['adj_close'].plot()
If you want a chart with one series by ticker on it
You need to groupby before:
df.set_index('Date', inplace=True)
df.groupby('ticker')['adj_close'].plot(legend=True)
If you want a chart with individual subplots:
grouped = df.groupby('ticker')
ncols=2
nrows = int(np.ceil(grouped.ngroups/ncols))
fig, axes = plt.subplots(nrows=nrows, ncols=ncols, figsize=(12,4), sharey=True)
for (key, ax) in zip(grouped.groups.keys(), axes.flatten()):
grouped.get_group(key).plot(ax=ax)
ax.legend()
plt.show()

Similar to Julien's answer above, I had success with the following:
fig, ax = plt.subplots(figsize=(10,4))
for key, grp in df.groupby(['ticker']):
ax.plot(grp['Date'], grp['adj_close'], label=key)
ax.legend()
plt.show()
This solution might be more relevant if you want more control in matlab.
Solution inspired by: https://stackoverflow.com/a/52526454/10521959

The question is How can I plot based on the ticker the adj_close versus Date?
This can be accomplished by reshaping the dataframe to a wide format with .pivot or .groupby, or by plotting the existing long form dataframe directly with seaborn.
In the following sample data, the 'Date' column has a datetime64[ns] Dtype.
Convert the Dtype with pandas.to_datetime if needed.
Tested in python 3.10, pandas 1.4.2, matplotlib 3.5.1, seaborn 0.11.2
Imports and Sample Data
import pandas as pd
import pandas_datareader as web # for sample data; this can be installed with conda if using Anaconda, otherwise pip
import seaborn as sns
import matplotlib.pyplot as plt
# sample stock data, where .iloc[:, [5, 6]] selects only the 'Adj Close' and 'tkr' column
tickers = ['aapl', 'acn']
df = pd.concat((web.DataReader(ticker, data_source='yahoo', start='2020-01-01', end='2022-06-21')
.assign(ticker=ticker) for ticker in tickers)).iloc[:, [5, 6]]
# display(df.head())
Date Adj Close ticker
0 2020-01-02 73.785904 aapl
1 2020-01-03 73.068573 aapl
2 2020-01-06 73.650795 aapl
3 2020-01-07 73.304420 aapl
4 2020-01-08 74.483604 aapl
# display(df.tail())
Date Adj Close ticker
1239 2022-06-14 275.119995 acn
1240 2022-06-15 281.190002 acn
1241 2022-06-16 270.899994 acn
1242 2022-06-17 275.380005 acn
1243 2022-06-21 282.730011 acn
pandas.DataFrame.pivot & pandas.DataFrame.plot
pandas plots with matplotlib as the default backend.
Reshaping the dataframe with pandas.DataFrame.pivot converts from long to wide form, and puts the dataframe into the correct format to plot.
.pivot does not aggregate data, so if there is more than 1 observation per index, per ticker, then use .pivot_table
Adding subplots=True will produce a figure with two subplots.
# reshape the long form data into a wide form
dfp = df.pivot(index='Date', columns='ticker', values='Adj Close')
# display(dfp.head())
ticker aapl acn
Date
2020-01-02 73.785904 203.171112
2020-01-03 73.068573 202.832764
2020-01-06 73.650795 201.508224
2020-01-07 73.304420 197.157654
2020-01-08 74.483604 197.544434
# plot
ax = dfp.plot(figsize=(11, 6))
Use seaborn, which accepts long form data, so reshaping the dataframe to a wide form isn't necessary.
seaborn is a high-level api for matplotlib
sns.lineplot: axes-level plot
fig, ax = plt.subplots(figsize=(11, 6))
sns.lineplot(data=df, x='Date', y='Adj Close', hue='ticker', ax=ax)
sns.relplot: figure-level plot
Adding row='ticker', or col='ticker', will generate a figure with two subplots.
g = sns.relplot(kind='line', data=df, x='Date', y='Adj Close', hue='ticker', aspect=1.75)

Is there a way to plot a pandas series in ggplot?

I'm experimenting with pandas and non-matplotlib plotting. Good suggestions are here. This question regards yhat's ggplot and I am running into two issues.
Plotting a series in pandas is easy.
frequ.plot()
I don't see how to do this in the ggplot docs. Instead I end up creating a dataframe:
cheese = DataFrame({'time': frequ.index, 'count' : frequ.values})
ggplot(cheese, aes(x='time', y='count')) + geom_line()
I would expect ggplot -- a project that has "tight integration with pandas" -- to have a way to plot a simple series.
Second issue is I can't get stat_smooth() to display when the x axis is time of day. Seems like it could be related to this post, but I don't have the rep to post there. My code is:
frequ = values.sampler.resample("1Min", how="count")
cheese = DataFrame({'time': frequ.index, 'count' : frequ.values})
ggplot(cheese, aes(x='time', y='count')) + geom_line() + stat_smooth()
Any help regarding non-matplotlib plotting would be appreciated. Thanks!
(I'm using ggplot 0.5.8)

I run into this problem frequently in Python's ggplot when working with multiple stock prices and economic timeseries. The key to remember with ggplot is that data is best organized in long format to avoid any issues. I use a quick two step process as a workaround. First let's grab some stock data:
import pandas.io.data as web
import pandas as pd
import time
from ggplot import *
stocks = [ 'GOOG', 'MSFT', 'LNKD', 'YHOO', 'FB', 'GOOGL','HPQ','AMZN'] # stock list
# get stock price function #
def get_px(stock, start, end):
return web.get_data_yahoo(stock, start, end)['Adj Close']
# dataframe of equity prices
px = pd.DataFrame({n: get_px(n, '1/1/2014', date_today) for n in stocks})
px.head()
AMZN FB GOOG GOOGL HPQ LNKD MSFT YHOO
Date
2014-01-02 397.97 54.71 NaN 557.12 27.40 207.64 36.63 39.59
2014-01-03 396.44 54.56 NaN 553.05 28.07 207.42 36.38 40.12
2014-01-06 393.63 57.20 NaN 559.22 28.02 203.92 35.61 39.93
2014-01-07 398.03 57.92 NaN 570.00 27.91 209.64 35.89 40.92
2014-01-08 401.92 58.23 NaN 571.19 27.19 209.06 35.25 41.02
First understand that ggplot needs the datetime index to be a column in the pandas dataframe in order to plot correctly when switching from wide to long format. I wrote a function to address this particular point. It simply creates a 'Date' column of type=datetime from the pandas series index.
def dateConvert(df):
df['Date'] = df.index
df.reset_index(drop=True)
return df
From there run the function on the df. Use the result as the object in pandas pd.melt using the 'Date' as the id_vars. The returned df is now ready to be plotted using the standard ggplot() format.
px_returns = px.pct_change() # common stock transformation
cumRet = (1+px_returns).cumprod() - 1 # transform daily returns to cumulative
cumRet_dateConverted = dateConvert(cumRet) # run the function here see the result below#
cumRet_dateConverted.info()
<class 'pandas.core.frame.DataFrame'>
DatetimeIndex: 118 entries, 2014-01-02 00:00:00 to 2014-06-20 00:00:00
Data columns (total 9 columns):
AMZN 117 non-null float64
FB 117 non-null float64
GOOG 59 non-null float64
GOOGL 117 non-null float64
HPQ 117 non-null float64
LNKD 117 non-null float64
MSFT 117 non-null float64
YHOO 117 non-null float64
Date 118 non-null datetime64[ns]
dtypes: datetime64[ns](1), float64(8)
data = pd.melt(cumRet_dateConverted, id_vars='Date').dropna() # Here is the method I use to format the data in the long format. Please note the use of 'Date' as the id_vars.
data = data.rename(columns = {'Date':'Date','variable':'Stocks','value':'Returns'}) # common to rename these columns
From here you can now plot your data however you want. A common plot I use is the following:
retPlot_YTD = ggplot(data, aes('Date','Returns',color='Stocks')) \
+ geom_line(size=2.) \
+ geom_hline(yintercept=0, color='black', size=1.7, linetype='-.') \
+ scale_y_continuous(labels='percent') \
+ scale_x_date(labels='%b %d %y',breaks=date_breaks('week') ) \
+ theme_seaborn(style='whitegrid') \
+ ggtitle(('%s Cumulative Daily Return vs Peers_YTD') % key_Stock)
fig = retPlot_YTD.draw()
ax = fig.axes[0]
offbox = ax.artists[0]
offbox.set_bbox_to_anchor((1, 0.5), ax.transAxes)
fig.show()

This is more of a workaround but you can use qplot for quick, shorthand plots using series.
from ggplot import *
qplot(meat.beef)

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

how to pick out individual columns of numerical values from Datareader pandas? - python

Related

How to plot a variable dataframe

What is the most efficient and pythonic way to calculate (TI) over a large data set?

Matplotlib bar chart on datetime index values

Pandas dataframe groupby plot

Is there a way to plot a pandas series in ggplot?

Categories

Resources