Pulling specific value from Pandas DataReader dataframe? - python

Here is the code I am running:
def competitor_stock_data_report():
import datetime
import pandas_datareader.data as web
date_time = datetime.datetime.now()
date = date_time.date()
stocklist = ['LAZ','AMG','BEN','LM','EVR','GHL','HLI','MC','PJT','MS','GS','JPM','AB']
start = datetime.datetime(date.year-1, date.month, date.day-1)
end = datetime.datetime(date.year, date.month, date.day-1)
for x in stocklist:
df = web.DataReader(x, 'google', start, end)
print(df)
print(df.loc[df['Date'] == start]['Close'].values)
The problem is in the last line. How do I pull the specific value of the date specified 'Close' value?
Open High Low Close Volume
Date
2016-08-02 35.22 35.25 33.66 33.75 861111
2016-08-03 33.57 34.72 33.42 34.25 921401
2016-08-04 33.89 34.22 33.77 34.07 587016
2016-08-05 34.55 34.94 34.31 34.35 463317
2016-08-08 34.54 34.75 34.31 34.74 958230
2016-08-09 34.68 35.12 34.64 34.87 732959
I would like to get 33.75 for example, but the date is dynamically changing..
Any suggestions?

IMO the easiest way to get a column's value in the first row:
In [40]: df
Out[40]:
Open High Low Close Volume
Date
2016-08-03 767.18 773.21 766.82 773.18 1287421
2016-08-04 772.22 774.07 768.80 771.61 1140254
2016-08-05 773.78 783.04 772.34 782.22 1801205
2016-08-08 782.00 782.63 778.09 781.76 1107857
2016-08-09 781.10 788.94 780.57 784.26 1318894
... ... ... ... ... ...
2017-07-27 951.78 951.78 920.00 934.09 3212996
2017-07-28 929.40 943.83 927.50 941.53 1846351
2017-07-31 941.89 943.59 926.04 930.50 1970095
2017-08-01 932.38 937.45 929.26 930.83 1277734
2017-08-02 928.61 932.60 916.68 930.39 1824448
[252 rows x 5 columns]
In [41]: df.iat[0, df.columns.get_loc('Close')]
Out[41]: 773.17999999999995
Last row:
In [42]: df.iat[-1, df.columns.get_loc('Close')]
Out[42]: 930.38999999999999

Recommended
df.at[df.index[-1], 'Close']
df.iat[-1, df.columns.get_loc('Close')]
df.loc[df.index[-1], 'Close']
df.iloc[-1, df.columns.get_loc('Close')]
Not intended as public api, but works
df.get_value(df.index[-1], 'Close')
df.get_value(-1, df.columns.get_loc('Close'), takeable=True)
Not recommended, chained indexing
There could be more, but do I really need to add them
df.iloc[-1].at['Close']
df.loc[:, 'Close'].iat[-1]
All yield
34.869999999999997

Related

Slicing pandas dataframe via index datetime

I tried to slice a pandas dataframe, that was read from the CSV file and the index was set from the first column of dates.
IN:
df = pd.read_csv(r'E:\...\^d.csv')
df["Date"] = pd.to_datetime(df["Date"])
OUT:
Date Open High Low Close Volume
0 1920-01-02 9.52 9.52 9.52 9.52 NaN
1 1920-01-03 9.62 9.62 9.62 9.62 NaN
2 1920-01-05 9.57 9.57 9.57 9.57 NaN
3 1920-01-06 9.46 9.46 9.46 9.46 NaN
4 1920-01-07 9.47 9.47 9.47 9.47 NaN
Date Open High Low Close Volume
26798 2020-10-26 3441.42 3441.42 3364.86 3400.97 2.435787e+09
26799 2020-10-27 3403.15 3409.51 3388.71 3390.68 2.395102e+09
26800 2020-10-28 3342.48 3342.48 3268.89 3271.03 3.147944e+09
26801 2020-10-29 3277.17 3341.05 3259.82 3310.11 2.752626e+09
26802 2020-10-30 3293.59 3304.93 3233.94 3269.96 3.002804e+09
IN:
df = df.set_index(['Date'])
print("my index type is ")
print(df.index.dtype)
print(type(df.index)) #type of index
OUT:
Open High Low Close Volume
Date
2007-01-03 1418.03 1429.42 1407.86 1416.60 1.905089e+09
2007-01-04 1416.95 1421.84 1408.22 1418.34 1.669144e+09
2007-01-05 1418.34 1418.34 1405.75 1409.71 1.621889e+09
2007-01-08 1409.22 1414.98 1403.97 1412.84 1.535189e+09
2007-01-09 1412.85 1415.61 1405.42 1412.11 1.687989e+09
... ... ... ... ...
2009-12-24 1120.59 1126.48 1120.59 1126.48 7.042833e+08
2009-12-28 1126.48 1130.38 1123.51 1127.78 1.509111e+09
2009-12-29 1127.78 1130.38 1126.08 1126.19 1.383900e+09
2009-12-30 1126.19 1126.42 1121.94 1126.42 1.265167e+09
2009-12-31 1126.42 1127.64 1114.81 1115.10 1.153883e+09
my index type is
datetime64[ns]
<class 'pandas.core.indexes.datetimes.DatetimeIndex'>
I try to slice for Mondays using
monday_dow = df["Date"].dt.dayofweek==0
OUT (Spyder returns):
KeyError: 'Date'
I've read a lot and similar answers on stackoverflow, but could fix this, although I understand I do something wrong with index, it should be called another way?
You need filter by DatetimeIndex by DatetimeIndex.dayofweek (removed .dt used only for columns):
monday_dow = df.index.dayofweek==0
So if need all rows:
df1 = df[monday_dow]
Also here is possible simplify code for set DatimeIndex in read_csv:
df = pd.read_csv(r'E:\...\^d.csv', index_col=['Date'], parse_dates=['Date'])
monday_dow = df.index.dayofweek==0
df1 = df[monday_dow]

Pandas DataReader: normalizing dates

I use pandas data reader package to pull economic time series from website like fred, yahoo finance. I have pulled us recession (USREC) series from the 'fred' website and historical sp500 (^GSPC) from yahoo finance.
Historical US recession:
web.DataReader("USREC", "fred", start, end)
Output:
2017-08-01 0
2017-09-01 0
2017-10-01 0
2017-11-01 0
S&P500 returns
web.DataReader("^GSPC",'yahoo',start,end)['Close'].to_frame().resample('M').mean().round()
Output:
2017-08-31 2456.0
2017-09-30 2493.0
2017-10-31 2557.0
2017-11-30 2594.0
I want to merge the two data frames, but one has beginning date of the month and other has ending date of the month. How do I make a) the date column yyyy-mm b) either make the date column of both frames month beginning or month end?
Thanks for the help!
You can use MS for resample by start of months:
web.DataReader("^GSPC",'yahoo',start,end)['Close'].to_frame().resample('MS').mean().round()
Or is possible use to_period for month PeriodIndex:
df1 = df1.to_period('M')
df2 = df2.to_period('M')
print (df1)
Close
2017-08 0
2017-09 0
2017-10 0
2017-11 0
print (df2)
Close
2017-08 2456.0
2017-09 2493.0
2017-10 2557.0
2017-11 2594.0
print (df1.index)
PeriodIndex(['2017-08', '2017-09', '2017-10', '2017-11'], dtype='period[M]', freq='M')
print (df2.index)
PeriodIndex(['2017-08', '2017-09', '2017-10', '2017-11'], dtype='period[M]', freq='M')

pick month start and end data in python

I have stock data downloaded from yahoo finance. I want to pickup data in the row corresponding to monthly start and month end. I am trying to do it with python pandas data frame. But I am not getting correct method to get the starting & ending of the month. will be great full if somebody can help me in solving this.
Please note that if 1st of the month is holiday and there is no data for that, I need to pick up 2nd day's data. Same rule applies to last of the month also. Thanks in advance.
Example data is
2016-01-05,222.80,222.80,217.00,217.75,15074800,217.75
2016-01-04,226.95,226.95,220.05,220.70,14092000,220.70
2015-12-31,225.95,226.55,224.00,224.45,11558300,224.45
2015-12-30,229.00,229.70,224.85,225.80,11702800,225.80
2015-12-29,228.85,229.95,227.50,228.20,7263200,228.20
2015-12-28,229.05,229.95,228.00,228.90,8756800,228.90
........
........
2015-12-04,240.00,242.15,238.05,241.10,11115100,241.10
2015-12-03,244.15,244.50,240.40,241.10,7155600,241.10
2015-12-02,250.55,250.65,243.75,244.60,10881700,244.60
2015-11-30,249.65,253.00,245.00,250.20,12865400,250.20
2015-11-27,243.00,250.50,242.80,249.70,15149900,249.70
2015-11-26,241.95,244.90,241.00,242.50,13629800,242.50
First, you should convert your date column to datetime format, then group by month, then sort groupby Series by date and take the first/last from it using head/tail methods, like so:
In [37]: df
Out[37]:
0 1 2 3 4 5 6
0 2016-01-05 222.80 222.80 217.00 217.75 15074800 217.75
1 2016-01-04 226.95 226.95 220.05 220.70 14092000 220.70
2 2015-12-31 225.95 226.55 224.00 224.45 11558300 224.45
3 2015-12-30 229.00 229.70 224.85 225.80 11702800 225.80
4 2015-12-29 228.85 229.95 227.50 228.20 7263200 228.20
5 2015-12-28 229.05 229.95 228.00 228.90 8756800 228.90
In [25]: import datetime
In [29]: df[0] = df[0].apply(lambda x: datetime.datetime.strptime(x, '%Y-%m-%d')
)
In [36]: df.groupby(df[0].apply(lambda x: x.month)).apply(lambda x: x.sort_value
s(0).head(1))
Out[36]:
0 1 2 3 4 5 6
0
1 1 2016-01-04 226.95 226.95 220.05 220.7 14092000 220.7
12 5 2015-12-28 229.05 229.95 228.00 228.9 8756800 228.9
In [38]: df.groupby(df[0].apply(lambda x: x.month)).apply(lambda x: x.sort_value
s(0).tail(1))
Out[38]:
0 1 2 3 4 5 6
0
1 0 2016-01-05 222.80 222.80 217.0 217.75 15074800 217.75
12 2 2015-12-31 225.95 226.55 224.0 224.45 11558300 224.45
You can merge the result dataframes, using pd.concat()
For the first / last day of each month, you can use .resample() with 'BMS' and 'BM' for Business Month (Start) like so (using pandas 0.18 syntax):
df.resample('BMS').first()
df.resample('BM').last()
This assumes that your data have a DateTimeIndex as usual when downloaded from yahoo using pandas_datareader:
from datetime import datetime
from pandas_datareader.data import DataReader
df = DataReader('FB', 'yahoo', datetime(2015, 1, 1), datetime(2015, 3, 31))['Open']
df.head()
Date
2015-01-02 78.580002
2015-01-05 77.980003
2015-01-06 77.230003
2015-01-07 76.760002
2015-01-08 76.739998
Name: Open, dtype: float64
df.tail()
Date
2015-03-25 85.500000
2015-03-26 82.720001
2015-03-27 83.379997
2015-03-30 83.809998
2015-03-31 82.900002
Name: Open, dtype: float64
do:
df.resample('BMS').first()
Date
2015-01-01 78.580002
2015-02-02 76.110001
2015-03-02 79.000000
Freq: BMS, Name: Open, dtype: float64
and
df.resample('BM').last()
to get:
Date
2015-01-30 78.000000
2015-02-27 80.680000
2015-03-31 82.900002
Freq: BM, Name: Open, dtype: float64
Assuming you have downloaded data from Yahoo:
> import pandas.io.data as web
> import datetime
> start = datetime.datetime(2016,1,1)
> end = datetime.datetime(2016,5,1)
> df = web.DataReader("AAPL", "yahoo", start, end)
You simply pick the month end and start rows with:
df[df.index.is_month_end]
df[df.index.is_month_start]
If you want to access a specific row, like the first row of the first starting day of the selected starting days, you simply do:
df[df.index.is_month_start].ix[0]

Pandas: decompress date range to individual dates

Dataset: I have a 1GB dataset of stocks, which have values between date ranges. There is no overlapping in date ranges and the dataset is sorted on (ticker, start_date).
>>> df.head()
start_date end_date val
ticker
AAPL 2014-05-01 2014-05-01 10.0000000000
AAPL 2014-06-05 2014-06-10 20.0000000000
GOOG 2014-06-01 2014-06-15 50.0000000000
MSFT 2014-06-16 2014-06-16 None
TWTR 2014-01-17 2014-05-17 10.0000000000
Goal: I want to decompress the dataframe so that I have individual dates instead of date ranges. For example, the AAPL rows would go from being only 2 rows to 7 rows:
>>> AAPL_decompressed.head()
val
date
2014-05-01 10.0000000000
2014-06-05 20.0000000000
2014-06-06 20.0000000000
2014-06-07 20.0000000000
2014-06-08 20.0000000000
I'm hoping there's a nice optimized method from pandas like resample that can do this in a couple lines.
A bit more than a few lines, but I think it results in what you asked:
Starting with your dataframe:
In [70]: df
Out[70]:
start_date end_date val row
ticker
AAPL 2014-05-01 2014-05-01 10 0
AAPL 2014-06-05 2014-06-10 20 1
GOOG 2014-06-01 2014-06-15 50 2
MSFT 2014-06-16 2014-06-16 NaN 3
TWTR 2014-01-17 2014-05-17 10 4
First I reshape this dataframe to a dataframe with one date column (so every row two times repeated for each date of start_date and end_date (and I add a counter column called row):
In [60]: df['row'] = range(len(df))
In [61]: starts = df[['start_date', 'val', 'row']].rename(columns={'start_date': 'date'})
In [62]: ends = df[['end_date', 'val', 'row']].rename(columns={'end_date':'date'})
In [63]: df_decomp = pd.concat([starts, ends])
In [64]: df_decomp = df_decomp.set_index('row', append=True)
In [65]: df_decomp.sort_index()
Out[65]:
date val
ticker row
AAPL 0 2014-05-01 10
0 2014-05-01 10
1 2014-06-05 20
1 2014-06-10 20
GOOG 2 2014-06-01 50
2 2014-06-15 50
MSFT 3 2014-06-16 NaN
3 2014-06-16 NaN
TWTR 4 2014-01-17 10
4 2014-05-17 10
Based on this new dataframe, I can group it by ticker and row, and apply a daily resample on each of these groups and fillna (with method 'pad' to forward fill)
In [66]: df_decomp = df_decomp.groupby(level=[0,1]).apply(lambda x: x.set_index('date').resample('D').fillna(method='pad'))
In [67]: df_decomp = df_decomp.reset_index(level=1, drop=True)
The last command was to drop the now superfluous row index level.
When we access the AAPL rows, it gives your desired output:
In [69]: df_decomp.loc['AAPL']
Out[69]:
val
date
2014-05-01 10
2014-06-05 20
2014-06-06 20
2014-06-07 20
2014-06-08 20
2014-06-09 20
2014-06-10 20
I think you can do this in five steps:
1) filter the ticker column to find the stock you want
2) use pandas.bdate_range to build a list of date ranges between start and end
3) flatten this list using reduce
4) reindex your new filtered dataframe
5) fill nans using the method pad
Here is the code:
>>> import pandas as pd
>>> import datetime
>>> data = [('AAPL', datetime.date(2014, 4, 28), datetime.date(2014, 5, 2), 90),
('AAPL', datetime.date(2014, 5, 5), datetime.date(2014, 5, 9), 80),
('MSFT', datetime.date(2014, 5, 5), datetime.date(2014, 5, 9), 150),
('AAPL', datetime.date(2014, 5, 12), datetime.date(2014, 5, 16), 85)]
>>> df = pd.DataFrame(data=data, columns=['ticker', 'start', 'end', 'val'])
>>> df_new = df[df['ticker'] == 'AAPL']
>>> df_new.name = 'AAPL'
>>> df_new.index = df_new['start']
>>> df_new.index.name = 'date'
>>> df_new.index = df_new.index.to_datetime()
>>> from functools import reduce #for py3k only
>>> new_index = [pd.bdate_range(**d) for d in df_new[['start','end']].to_dict('record')]
>>> new_index_flat = reduce(pd.tseries.index.DatetimeIndex.append, new_index)
>>> df_new = df_new.reindex(new_index_flat)
>>> df_new = df_new.fillna(method='pad')
>>> df_new
ticker start end val
2014-04-28 AAPL 2014-04-28 2014-05-02 90
2014-04-29 AAPL 2014-04-28 2014-05-02 90
2014-04-30 AAPL 2014-04-28 2014-05-02 90
2014-05-01 AAPL 2014-04-28 2014-05-02 90
2014-05-02 AAPL 2014-04-28 2014-05-02 90
2014-05-05 AAPL 2014-05-05 2014-05-09 80
2014-05-06 AAPL 2014-05-05 2014-05-09 80
2014-05-07 AAPL 2014-05-05 2014-05-09 80
2014-05-08 AAPL 2014-05-05 2014-05-09 80
2014-05-09 AAPL 2014-05-05 2014-05-09 80
2014-05-12 AAPL 2014-05-12 2014-05-16 85
2014-05-13 AAPL 2014-05-12 2014-05-16 85
2014-05-14 AAPL 2014-05-12 2014-05-16 85
2014-05-15 AAPL 2014-05-12 2014-05-16 85
2014-05-16 AAPL 2014-05-12 2014-05-16 85
[15 rows x 4 columns]
Hope it helps!
Here is a hacky way to do it - I post this poor answer (remember - I can't code :-) ) because I am new to pandas and would not mind someone improving it.
This reads the file that had the data originally posted - then creates a multi-index off of the stock_id and the end_date. The get_val function below takes the entire frame, a ticker e.g. 'AAPL', and a date and uses index.searchsorted which behaves like map::upper_bound in C++ - i.e. finds the index where the date would be inserted if you wanted to insert - i.e. find the enddate closest to, but after the date in question - this will have the value that we want and we return that with get_val.
Then I grab a cross section from a Series with this Multiindex based on the stock_id of 'AAPL'. Then we form an empty list which will be used to flatten the list of tuples of dates from the multiindex with the key of 'AAPL'. These dates become the index and values of a series. Then I map this series to get_val to grab the stock price that is desired.
I know this is probably wrong...but...happy to learn.
I would not be surprised to find out that there is an easy way to inflate a data frame like this that uses some fill forward interpolation method...
stocks=pd.read_csv('stocks2.csv', parse_dates=['start_date', 'end_date'], index_col='ticker')
mi=zip(stocks.index, pd.Series(zip(stocks['start_date'],stocks['end_date'].values)).map(lambda z: tuple(pd.date_range(start=z[0], end=z[1]))).values)
mi=pd.MultiIndex.from_tuples(mi)
ticker='AAPL'
s=pd.Series(index=mi,data=0)
s=list(s.xs(key=ticker).index)
l=[]
map(lambda x: l.extend(x), s)
s=pd.Series(index=l,data=l)
stocks_byticker=stocks[stocks.index==ticker].set_index('end_date')
print(s.map(lambda x: stocks_byticker.ix[stocks_byticker.index.searchsorted(x), 'val']))
2014-05-01 10
2014-06-05 20
2014-06-06 20
2014-06-07 20
2014-06-08 20
2014-06-09 20
2014-06-10 20
Here's a slightly more general way to do this that expands on joris's good answer, but allows this to work with any number of additional columns:
import pandas as pd
df['join_id'] = range(len(df))
starts = df[['start_date', 'join_id']].rename(columns={'start_date': 'date'})
ends = df[['end_date', 'join_id']].rename(columns={'end_date': 'date'})
start_end = pd.concat([starts, ends]).set_index('date')
fact_table = start_end.groupby("task_id").apply(lambda x: x.resample('D').fillna(method='pad'))
del fact_table["join_id"]
fact_table = fact_table.reset_index()
final = fact_table.merge(df, right_on='join', left_on='join', how='left')

Finding the min date in a Pandas DF row and create new Column

I have a table with a number of dates (some dates will be NaN) and I need to find the oldest date
so a row may have DATE_MODIFIED, WITHDRAWN_DATE, SOLD_DATE, STATUS_DATE etc..
So for each row there will be a date in one or more of the fields I want to find the oldest of those and make a new column in the dataframe.
Something like this, if I just do one , eg DATE MODIFIED I get a result but when I add the second as below
table['END_DATE']=min([table['DATE_MODIFIED']],[table['SOLD_DATE']])
I get:
ValueError: The truth value of an array with more than one element is ambiguous. Use a.any() or a.all()
For that matter will this construct work to find the min date, assuming I create correct date columns initially?
Just apply the min function along the axis=1.
In [1]: import pandas as pd
In [2]: df = pd.read_csv('test.cvs', parse_dates=['d1', 'd2', 'd3'])
In [3]: df.ix[2, 'd1'] = None
In [4]: df.ix[1, 'd2'] = None
In [5]: df.ix[4, 'd3'] = None
In [6]: df
Out[6]:
d1 d2 d3
0 2013-02-07 00:00:00 2013-03-08 00:00:00 2013-05-21 00:00:00
1 2013-02-07 00:00:00 NaT 2013-05-21 00:00:00
2 NaT 2013-03-02 00:00:00 2013-05-21 00:00:00
3 2013-02-04 00:00:00 2013-03-08 00:00:00 2013-01-04 00:00:00
4 2013-02-01 00:00:00 2013-03-06 00:00:00 NaT
In [7]: df.min(axis=1)
Out[7]:
0 2013-02-07 00:00:00
1 2013-02-07 00:00:00
2 2013-03-02 00:00:00
3 2013-01-04 00:00:00
4 2013-02-01 00:00:00
dtype: datetime64[ns]
If tableis your DataFrame, then use its min method on the relevant columns:
table['END_DATE'] = table[['DATE_MODIFIED','SOLD_DATE']].min(axis=1)
A slight variation over Felix Zumstein's
table['END_DATE'] = table[['DATE_MODIFIED','SOLD_DATE']].min(axis=1).astype('datetime64[ns]')
The astype('datetime64[ns]') is necessary in the current version of pandas (july 2015) to avoid getting a float64 representation of the dates.

Categories

Resources