Indexing while using pandas loc

Indexing while using pandas loc - python

I have a pd.DataFrame which had a Datetime index. When I select a particular Datetime and when there are two matches, it correctly selects date as the index.
a = df.loc[date]
a
Symbol Order Shares
Date
2011-01-10 AAPL BUY 1500
2011-01-10 AAPL SELL 1500
However, when there is only one match for that date, it assigns the columns as the index.
orders
Symbol GOOG
Order BUY
Shares 1000
Name: 2011-01-26 00:00:00, dtype: object
How can I force it to take Date as the index all the time? I don't have a name for the index.

You can use one element list:
a = df.loc[[date]]
Sample:
print (df)
Symbol Order Shares
Date
2011-01-09 AAPL BUY 1500
2011-01-10 AAPL BUY 1500
2011-01-10 AAPL SELL 1500
date = '2011-01-09'
a = df.loc[[date]]
print (a)
Symbol Order Shares
Date
2011-01-09 AAPL BUY 1500
date = '2011-01-10'
a = df.loc[[date]]
print (a)
Symbol Order Shares
Date
2011-01-10 AAPL BUY 1500
2011-01-10 AAPL SELL 1500

Related

calc churn rate in pandas

i have a sales dataset (simplified) with sales from existing customers (first_order = 0)
import pandas as pd
import datetime as dt
df = pd.DataFrame({'Date':['2020-06-30 00:00:00','2020-05-05 00:00:00','2020-04-10 00:00:00','2020-02-26 00:00:00'],
'email':['1#abc.de','2#abc.de','3#abc.de','1#abc.de'],
'first_order':[1,1,1,1],
'Last_Order_Date':['2020-06-30 00:00:00','2020-05-05 00:00:00','2020-04-10 00:00:00','2020-02-26 00:00:00']
})
I would like to analyze how many existing customers we lose per month.
my idea is to
group(count) by month and
then count how many have made their last purchase in the following months which gives me a churn cross table where I can see that e.g. we had 300 purchases in January, and 10 of them bought the last time in February.
like this:
Col B is the total number of repeating customers and column C and further is the last month they bought something.
E.g. we had 2400 customers in January, 677 of them made their last purchase in this month, 203 more followed in February etc.
I guess I could first group the total number of sales per month and then group a second dataset by Last_Order_Date and filter by month.
but I guess there is a handy python way ?! :)
any ideas?
thanks!

The below code helps you to identify how many purchases are done in each month.
df['Date'] = pd.to_datetime(df['Date'])
df = df.set_index('Date')
df.groupby(pd.Grouper(freq="M")).size()
O/P:
Date
2020-02-29 1
2020-03-31 0
2020-04-30 1
2020-05-31 1
2020-06-30 1
Freq: M, dtype: int64
I couldn't find the required data and explanation clearly. This could be your starting point. Please let me know, if it is helped you in anyway.
Update-1:
df.pivot_table(index='Date', columns='email', values='Last_Order_Date', aggfunc='count')
Output:
email 1#abc.de 2#abc.de 3#abc.de
Date
2020-02-26 00:00:00 1.0 NaN NaN
2020-04-10 00:00:00 NaN NaN 1.0
2020-05-05 00:00:00 NaN 1.0 NaN
2020-06-30 00:00:00 1.0 NaN NaN

Pandas DataReader: normalizing dates

I use pandas data reader package to pull economic time series from website like fred, yahoo finance. I have pulled us recession (USREC) series from the 'fred' website and historical sp500 (^GSPC) from yahoo finance.
Historical US recession:
web.DataReader("USREC", "fred", start, end)
Output:
2017-08-01 0
2017-09-01 0
2017-10-01 0
2017-11-01 0
S&P500 returns
web.DataReader("^GSPC",'yahoo',start,end)['Close'].to_frame().resample('M').mean().round()
Output:
2017-08-31 2456.0
2017-09-30 2493.0
2017-10-31 2557.0
2017-11-30 2594.0
I want to merge the two data frames, but one has beginning date of the month and other has ending date of the month. How do I make a) the date column yyyy-mm b) either make the date column of both frames month beginning or month end?
Thanks for the help!

You can use MS for resample by start of months:
web.DataReader("^GSPC",'yahoo',start,end)['Close'].to_frame().resample('MS').mean().round()
Or is possible use to_period for month PeriodIndex:
df1 = df1.to_period('M')
df2 = df2.to_period('M')
print (df1)
Close
2017-08 0
2017-09 0
2017-10 0
2017-11 0
print (df2)
Close
2017-08 2456.0
2017-09 2493.0
2017-10 2557.0
2017-11 2594.0
print (df1.index)
PeriodIndex(['2017-08', '2017-09', '2017-10', '2017-11'], dtype='period[M]', freq='M')
print (df2.index)
PeriodIndex(['2017-08', '2017-09', '2017-10', '2017-11'], dtype='period[M]', freq='M')

Pandas Panel to DataFrame indexing

I have the following code:
import pandas as pd
import pandas_datareader.data as web
pdata = pd.Panel(dict((stk, web.get_data_yahoo(stk, '1/1/2009', '6/1/2012'))
for stk in ['AAPL', 'GOOG', 'MSFT']))
pdata
<class 'pandas.core.panel.Panel'>
Dimensions: 6 (items) x 861 (major_axis) x 3 (minor_axis)
Items axis: Open to Volume
Major_axis axis: 2009-01-02 00:00:00 to 2012-06-01 00:00:00
Minor_axis axis: AAPL to MSFT
AAPL GOOG MSFT
Date minor
2009-01-02 Open 12.268572 153.302917 19.530001
High 13.005714 159.870193 20.400000
Low 12.165714 151.762924 19.370001
Close 12.964286 159.621811 20.330000
Adj Close 11.621618 159.621811 16.140903
Normally the following method would give me what I need:
pdata = pdata.swapaxes('items', 'minor')
And I get the following warning:
Panel is deprecated and will be removed in a future version.
The recommended way to represent these types of 3-dimensional data are
with a MultiIndex on a DataFrame, via the Panel.to_frame() method
My objective is to have a Data Frame in the form of a Panel, using the date and stock ticker as Mayor and Minor row indices, and the Open price, etc as the columns as this:
minor Open High Low Close Adj Close
Date
2009-01-02 AAPL 12.268572 19.530001 12.165714 12.964286 11.621618
GOOG 153.302917 ... ... ... ...
MSFT 19.530001 ... ... ... ...
I did convert the Panel object into a DataFrame and tried to use the pivot_table or set_index methods but I can't get the stock tickers to be the inner row index. When I use the swapaxes method on the DF, the Date is also swaped to the columns. Is there any easy way I can get the format I need?

Option 1
unstack + swaplevel + sort_index
pdata.to_frame().unstack(0).T\
.swaplevel(0, 1).sort_index(level=[0]).head(6)
minor Open High Low Close Adj Close \
Date
2009-01-02 AAPL 12.268572 13.005714 12.165714 12.964286 11.621618
GOOG 153.302917 159.870193 151.762924 159.621811 159.621811
MSFT 19.530001 20.400000 19.370001 20.330000 16.140903
2009-01-05 AAPL 13.310000 13.740000 13.244286 13.511429 12.112095
GOOG 159.462845 164.549759 156.482239 162.965073 162.965073
MSFT 20.200001 20.670000 20.059999 20.520000 16.291746
minor Volume
Date
2009-01-02 AAPL 186503800.0
GOOG 7267900.0
MSFT 50084000.0
2009-01-05 AAPL 295402100.0
GOOG 9841400.0
MSFT 61475200.0
Option 2
Wen's wonderful stack equivalent.
pdata.to_frame().stack().unstack(-2).head(6)
minor Open High Low Close Adj Close \
Date
2009-01-02 AAPL 12.268572 13.005714 12.165714 12.964286 11.621618
GOOG 153.302917 159.870193 151.762924 159.621811 159.621811
MSFT 19.530001 20.400000 19.370001 20.330000 16.140903
2009-01-05 AAPL 13.310000 13.740000 13.244286 13.511429 12.112095
GOOG 159.462845 164.549759 156.482239 162.965073 162.965073
MSFT 20.200001 20.670000 20.059999 20.520000 16.291746
minor Volume
Date
2009-01-02 AAPL 186503800.0
GOOG 7267900.0
MSFT 50084000.0
2009-01-05 AAPL 295402100.0
GOOG 9841400.0
MSFT 61475200.0

How to loop over unique ticker values in a Pandas DataFrame?

I have the dataframe below, which has several stocks value for about 200 companies, I am trying to find a way to for loop and build a new dataframe which includes these companies' different yearly feature
Date Symbol Open High Low Close Volume Daily Return
2016-01-04 AAPL 102.61 105.37 102.00 105.35 67281190 0.025703
2016-01-05 AAPL 105.75 105.85 102.41 102.71 55790992 0.019960
2016-12-28 AMZN 776.25 780.00 770.50 772.13 3301025 0.009122
2016-12-29 AMZN 772.40 773.40 760.85 765.15 3158299 0.020377
I have tried different way, the closest I have come is:
stocks_features = pd.DataFrame(data=stocks_data.Symbol.unique(), columns = ['Symbol'])
stocks_features['Max_Yearly_Price'] = stocks_data['High'].max()
stocks_features['Min_Yearly_Price'] = stocks_data['Low'].min()
stocks_features
But it gives me the same values for all stocks:
Symbol Max_Yearly_Price Min_Yearly_Price
AAPL 847.21 89.47
AMZN 847.21 89.47
What I am doing wrong, how can I accomplish this?

By using groupby agg
df.groupby('Symbol').agg({'High':'max','Low':'min'}).\
rename(columns={'High':'Max_Yearly_Price','Low':'Min_Yearly_Price'})
Out[861]:
Max_Yearly_Price Min_Yearly_Price
Symbol
AAPL 105.85 102.00
AMZN 780.00 760.85

Wen's answer is great as well. I had a different way of solving it. I'll explain as I go along:
# creates a dictionary of all the symbols and their max values
value_maps = dict(stocks_features.loc[stocks_features.\
groupby('Symbol').High.agg('idxmax')][['Symbol', 'High']].values)
# sets Max_Yearly_Price equal to the symbol
stocks_features['Max_Yearly_Price'] = stocks_features['Symbol']
# replaces the symbol wiht the corresponding value from the dicitonary
stocks_features['Max_Yearly_Price'] = stocks_features['Max_Yearly_Price'].map(value_maps)
# ouput
Date Symbol Open High Low Close Volume Daily Return Max_Yearly_Price
0 2016-01-04 AAPL 102.61 105.37 102.00 105.35 672811900.025703 NaN 105.85
1 2016-01-05 AAPL 105.75 105.85 102.41 102.71 557909920.019960 NaN 105.85
2 2016-12-28 AMZN 776.25 780.00 770.50 772.13 33010250.009122 NaN 780.00
3 2016-12-29 AMZN 772.40 773.40 760.85 765.15 31582990.020377 NaN 780.00

How do you convert time series data with pandas and show graph via highstock

I have some time series data as(financial stock trading data):
TIMESTAMP PRICE VOLUME
1294311545 24990 1500000000
1294317813 25499 5000000000
1294318449 25499 100000000
I need to convert them to OHLC values (JSON list) based on price column,ie,(open,high,low,close), and show that as OHLC graph with highstock JS framework.
The output should be as following:
[{'time':'2013-09-01','open':24999,'high':25499,'low':24999,'close':25000,'volume':15000000},
{'time':'2013-09-02','open':24900,'high':25600,'low':24800,'close':25010,'volume':16000000},
{...}]
For example,my sample have 10 data for day 2013-09-01,the output will have one object for the day with high is the highest price of all 10 data, low is the lowest price,open is the first price of the day, close is the last price of that day,volume should be the TOTAL volume of all 10 data.
I know there is a python library pandas maybe could do that,but i still could not try it out.
Updated: As suggestion, i use resample() as:
df['VOLUME'].resample('H', how='sum')
df['PRICE'].resample('H', how='ohlc')
But how to merge the result?

At the moment you can only perform ohlc on a column/Series (will be fixed in 0.13).
First, coerce TIMESTAMP columns to a pandas Timestamp:
In [11]: df.TIMESTAMP = pd.to_datetime(df.TIMESTAMP, unit='s')
In [12]: df.set_index('TIMESTAMP', inplace=True)
In [13]: df
Out[13]:
PRICE VOLUME
TIMESTAMP
2011-01-06 10:59:05 24990 1500000000
2011-01-06 12:43:33 25499 5000000000
2011-01-06 12:54:09 25499 100000000
The resample via ohlc (here I've resampled by hour):
In [14]: df['VOLUME'].resample('H', how='ohlc')
Out[14]:
open high low close
TIMESTAMP
2011-01-06 10:00:00 1500000000 1500000000 1500000000 1500000000
2011-01-06 11:00:00 NaN NaN NaN NaN
2011-01-06 12:00:00 5000000000 5000000000 100000000 100000000
In [15]: df['PRICE'].resample('H', how='ohlc')
Out[15]:
open high low close
TIMESTAMP
2011-01-06 10:00:00 24990 24990 24990 24990
2011-01-06 11:00:00 NaN NaN NaN NaN
2011-01-06 12:00:00 25499 25499 25499 25499
You can apply to_json to any DataFrame:
In [16]: df['PRICE'].resample('H', how='ohlc').to_json()
Out[16]: '{"open":{"1294308000000000000":24990.0,"1294311600000000000":null,"1294315200000000000":25499.0},"high":{"1294308000000000000":24990.0,"1294311600000000000":null,"1294315200000000000":25499.0},"low":{"1294308000000000000":24990.0,"1294311600000000000":null,"1294315200000000000":25499.0},"close":{"1294308000000000000":24990.0,"1294311600000000000":null,"1294315200000000000":25499.0}}'
*This would probably be a straightforward enhancement for a DataFrame atm its NotImplemented.
Updated: from your desired output (or at least very close to), can be achieved as follows:
In [21]: price = df['PRICE'].resample('D', how='ohlc').reset_index()
In [22]: price
Out[22]:
TIMESTAMP open high low close
0 2011-01-06 00:00:00 24990 25499 24990 25499
Use the records orientation and the iso date_format:
In [23]: price.to_json(date_format='iso', orient='records')
Out[23]: '[{"TIMESTAMP":"2011-01-06T00:00:00.000Z","open":24990,"high":25499,"low":24990,"close":25499}]'
In [24]: price.to_json('foo.json', date_format='iso', orient='records') # save as json file

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Indexing while using pandas loc - python

Related

calc churn rate in pandas

Pandas DataReader: normalizing dates

Pandas Panel to DataFrame indexing

How to loop over unique ticker values in a Pandas DataFrame?

How do you convert time series data with pandas and show graph via highstock

Categories

Resources