set all matching elements in a multiindex dataframe to a series - python

Previously I have been using a pandas.Panel to store multiple dataframes, one per date in a list of dates.
Since the deprecation of panels, I am trying to convert to using a multindex dataframe.
As an example, I have the following data:
dates = pandas.date_range('20180101', periods=3)
stocks = ['AAPL', 'GOOG', 'MSFT', 'AMZN', 'FB']
Before the deprecation, I could create a panel as follows:
pnl = pandas.Panel(items=dates, major_axis=stocks, minor_axis=stocks, dtype=float)
I now have 1 dataframe per date, for example, selecting the first:
pnl['2018-01-01']
returns a dataframe as follows:
Now, however, as per the advice in the depracation warning, I am creating a multiindex dataframe:
tuples = list(itertools.product(dates, stocks))
index = pandas.MultiIndex.from_tuples(tuples, names=['date', 'stock'])
df = pandas.DataFrame(index=index, columns=stocks, dtype=float)
The resulting dataframe now looks like this:
So far so good...
Populating the dataframe:
I have a pandas.Series of data for a given stock pair, with one entry per date.
For example:
data = pandas.Series([1.3, 7.4, 8.2], index=dates)
The series looks like this:
2018-01-01 1.3
2018-01-02 7.4
2018-01-03 8.2
Freq: D, dtype: float64
Say, for example, this data is for stock pair ['GOOG','MSFT'].
I would like to set all ['GOOG','MSFT'] entries.
With my panel, I could very easily do this using the following terse syntax:
pnl.loc[:,'GOOG','MSFT'] = data
What is the easiest way to select all ['GOOG','MSFT'] elements from my multiindex dataframe, and set them to my pandas.Series object (ie: date for date)?

Using pd.DataFrame.loc & pd.IndexSlice:
df.loc[pd.IndexSlice[data.index, 'GOOG'], 'MSFT'] = data.values
If you have many pairs of data, put them in a dictionary like this:
pairs = {('GOOG', 'MSFT'): data}
Then iterate through the pairs, setting the value using loc & pd.IndexSlice.
for k, v in pairs.items():
df.loc[pd.IndexSlice[v.index, k[0]], k[1]] = v.values
As an alternative to IndexSlice, you can set up a boolean index on the multiindex using the index method get_level_value
df.loc[ (df.index.get_level_values(1) == 'GOOG') &
(df.index.get_level_values(0).isin(data.index))
, 'MSFT'] = data.values
All of the above would produce the following output :
AAPL GOOG MSFT AMZN FB
date stock
2018-01-01 AAPL NaN NaN NaN NaN NaN
GOOG NaN NaN 1.3 NaN NaN
MSFT NaN NaN NaN NaN NaN
AMZN NaN NaN NaN NaN NaN
FB NaN NaN NaN NaN NaN
2018-01-02 AAPL NaN NaN NaN NaN NaN
GOOG NaN NaN 7.4 NaN NaN
MSFT NaN NaN NaN NaN NaN
AMZN NaN NaN NaN NaN NaN
FB NaN NaN NaN NaN NaN
2018-01-03 AAPL NaN NaN NaN NaN NaN
GOOG NaN NaN 8.2 NaN NaN
MSFT NaN NaN NaN NaN NaN
AMZN NaN NaN NaN NaN NaN
FB NaN NaN NaN NaN NaN

Related

Strip index as Pandas column

I have a table with a single index column looking like this:
1 2 3
Monday_0 NaN NaN NaN
Monday_1 NaN NaN NaN
Tuesday_2 NaN NaN NaN
Tuesday_3 NaN NaN NaN
I want to keep the index, but want the first part of the index into a new column. In other words, it should look like this:
1 2 3 Day
Monday_0 NaN NaN NaN Monday
Monday_1 NaN NaN NaN Monday
Tuesday_2 NaN NaN NaN Tuesday
Tuesday_3 NaN NaN NaN Tuesday
So I have tried a number of different solutions:
df = df.reset_index()
df['Day'] = str(df['index']).split('_')
This give me the whole series per row.
df['Day'] = str(df.index.split('_')[0])
Doesn't work as index does not have a split function
df['Day'] = df.index.as_type('str').split('_')[0]
Doesn't work as index does not have a as_type function
df.index.set_levels(df.index.get_level_values(level = 1).str.split('_')[0],
level = 1, inplace=True)
Doesn't work as 'Index' object has no attribute 'set_levels'. I guess it only works with multi index?
And with that I am all out of ideas
Try str.split
df['Day']=df.index.str.split('_').str[0]
df
Out[219]:
1 2 3 Day
Monday_0 NaN NaN NaN Monday
Monday_1 NaN NaN NaN Monday
Tuesday_2 NaN NaN NaN Tuesday
Tuesday_3 NaN NaN NaN Tuesday

Forward fill column one year after last observation

I forward fill values in the following df using:
df = (df.resample('d') # ensure data is daily time series
.ffill()
.sort_index(ascending=True))
df before forward fill
id a b c d
datadate
1980-01-31 NaN NaN NaN NaN
1980-02-29 NaN 2 NaN NaN
1980-03-31 NaN NaN NaN NaN
1980-04-30 1 NaN 3 4
1980-05-31 NaN NaN NaN NaN
... ... ... ...
2019-08-31 NaN NaN NaN NaN
2019-09-30 NaN NaN NaN NaN
2019-10-31 NaN NaN NaN NaN
2019-11-30 NaN NaN NaN NaN
2019-12-31 NaN NaN 20 33
However, I wish to only forward fill one year after (date is datetime) the last observation and then the remaining rows simply be NaN. I am not sure what is the best way to introduce this criteria in this task. Any help would be super!
Thanks
If I understand you correctly, you want to forward-fill the values on Dec 31, 2019 to the next year. Try this:
end_date = df.index.max()
new_end_date = end_date + pd.offsets.DateOffset(years=1)
new_index = df.index.append(pd.date_range(end_date, new_end_date, closed='right'))
df = df.reindex(new_index)
df.loc[end_date:, :] = df.loc[end_date:, :].ffill()
Result:
a b c d
1980-01-31 NaN NaN NaN NaN
1980-02-29 NaN 2.0 NaN NaN
1980-03-31 NaN NaN NaN NaN
1980-04-30 1.0 NaN 3.0 4.0
1980-05-31 NaN NaN NaN NaN
2019-08-31 NaN NaN NaN NaN
2019-09-30 NaN NaN NaN NaN
2019-10-31 NaN NaN NaN NaN
2019-11-30 NaN NaN NaN NaN
2019-12-31 NaN NaN 20.0 33.0
2020-01-01 NaN NaN 20.0 33.0
2020-01-02 NaN NaN 20.0 33.0
...
2020-12-31 NaN NaN 20.0 33.0
One solution is to forward fill using a limit parameter, but this wont handle the leap-year:
df.fillna(mehotd='ffill', limit=365)
The second solution is to define a more robust function to do the forward fill in the 1-year window:
from pandas.tseries.offsets import DateOffsets
def fun(serie_df):
serie = serie_df.copy()
indexes = serie[~serie.isnull()].index
for idx in indexes:
mask = (serie.index >= idx) & (serie.index < idx+DateOffset(years=1))
serie.loc[mask] = serie[mask].fillna(method='ffill')
return serie
df_filled = df.apply(fun, axis=0)
If a column has multiple non-nan values in the same 1-year window, then the first fill will stop once the most recent value is encounter. The second solution will treat the consecutive value as if they were independent.

My Dataframe all have NaN except the last column

I'm trying to loop over multiple JSON data and then for each value in list add it to the DataFrame. For each JSON data, I create a column header. I seem to always only get the data for the last column, so there is clearly something wrong with the way I append the data I believe.
from pycoingecko import CoinGeckoAPI
cg = CoinGeckoAPI()
df = pd.DataFrame()
timePeriod = 120
for x in range(10):
try:
data = cg.get_coin_market_chart_by_id(id=geckoList[x],
vs_currency ='btc', days = 'timePeriod')
for y in range(timePeriod):
df = df.append({geckoList[x]: data['prices'][y][1]},
ignore_index= True)
print(geckoList[x])
except:
pass
Geckolist example:
['bitcoin',
'ethereum',
'xrp',
'bitcoin-cash',
'litecoin',
'binance-coin']
Example JSON of one the coins:
'prices': [[1565176840078, 0.029035263522626625],
[1565177102060, 0.029079747150763842],
[1565177434439, 0.029128983083947863],
[1565177700686, 0.029136960678700433],
[1565178005716, 0.0290826667213779],
[1565178303855, 0.029173025688296675],
[1565178602640, 0.029204331218623796],
[1565178911561, 0.029211943928343167],
The expected result would be a DataFrame with columns and rows of data for each crypto coin. Right now only the last column shows data
Currently, it looks like this:
bitcoin ethereum bitcoin-cash
0 NaN NaN 0.33
1 NaN NaN 0.32
2 NaN NaN 0.21
3 NaN NaN 0.22
4 NaN NaN 0.25
5 NaN NaN 0.26
6 NaN NaN 0.22
7 NaN NaN 0.22
Ok I think I found the issue.
The problem is you append data structures row by row that contained only one column to the frame, so all the other columns were filled with NaN. What i think you want is to join the columns by their timestamp. This is what i did in my example below. Let me know if this is what you need:
from pycoingecko import CoinGeckoAPI
import pandas as pd
cg = CoinGeckoAPI()
timePeriod = 120
gecko_list = ['bitcoin',
'ethereum',
'xrp',
'bitcoin-cash',
'litecoin',
'binance-coin']
data = {}
for coin in gecko_list:
try:
nested_lists = cg.get_coin_market_chart_by_id(
id=coin, vs_currency='btc', days='timePeriod')['prices']
data[coin] = {}
data[coin]['timestamps'], data[coin]['values'] = zip(*nested_lists)
except Exception as e:
print(e)
print('coin: ' + coin)
frame_list = [pd.DataFrame(
data[coin]['values'],
index=data[coin]['timestamps'],
columns=[coin])
for coin in gecko_list
if coin in data]
df = pd.concat(frame_list, axis=1).sort_index()
df.index = pd.to_datetime(df.index, unit='ms')
print(df)
This gets me the output
bitcoin ethereum bitcoin-cash litecoin
2019-08-07 12:20:14.490 NaN NaN 0.029068 NaN
2019-08-07 12:20:17.420 NaN NaN NaN 0.007890
2019-08-07 12:20:21.532 1.0 NaN NaN NaN
2019-08-07 12:20:27.730 NaN 0.019424 NaN NaN
2019-08-07 12:24:45.309 NaN NaN 0.029021 NaN
... ... ... ... ...
2019-08-08 12:15:47.548 NaN NaN NaN 0.007578
2019-08-08 12:18:41.000 NaN 0.018965 NaN NaN
2019-08-08 12:18:44.000 1.0 NaN NaN NaN
2019-08-08 12:18:54.000 NaN NaN NaN 0.007577
2019-08-08 12:18:59.000 NaN NaN 0.028144 NaN
[1153 rows x 4 columns]
This is the data i get if i switch days to 180.
To get daily data, use the groupby function:
df = df.groupby(pd.Grouper(freq='D')).mean()
On a data frame of 5 days, this gives me:
bitcoin ethereum bitcoin-cash litecoin
2019-08-03 1.0 0.020525 0.031274 0.008765
2019-08-04 1.0 0.020395 0.031029 0.008583
2019-08-05 1.0 0.019792 0.029805 0.008360
2019-08-06 1.0 0.019511 0.029196 0.008082
2019-08-07 1.0 0.019319 0.028837 0.007854
2019-08-08 1.0 0.018949 0.028227 0.007593

How to set a cell value in a multi header/multi index pandas Dataframe

I have a Dataframe that looks like that:
SPY
Open High Low Close
Bid Ask Bid Ask Bid Ask Bid Ask
Date
2010-01-01 NaN NaN NaN NaN NaN NaN NaN NaN
2010-01-02 NaN NaN NaN NaN NaN NaN NaN NaN
2010-01-03 NaN NaN NaN NaN NaN NaN NaN NaN
2010-01-04 NaN NaN NaN NaN NaN NaN NaN NaN
I want to set a specific cell value, for example open bid for date 2010-01-04 so I tried this:
df.ix['2010-01-04', 'SPY']['Open']['Bid'] = 10
but nothing has happened to the dataframe. When I remove ['Bid'] at the end values for both bid and ask change but I don't know how to change only one value at a time.
Use a tuple to get at the MultiIndex element
.ix has been deprecated. In this case, use .at. .loc would also work, but .at is more efficient if getting at a particular cell.
df.at['2010-01-04', ('SPY', 'Open', 'Bid')] = 10

Pandas Pivot Table Subsetting

My pivot table looks like this:
Symbol DIA QQQ SPY XLE DIA QQQ SPY XLE DIA QQQ \
Open Open Open Open High High High High Low Low
Date
19930129 NaN NaN 29.083294 NaN NaN NaN 29.083294 NaN NaN NaN
19930201 NaN NaN 29.083294 NaN NaN NaN 29.269328 NaN NaN NaN
19930202 NaN NaN 29.248658 NaN NaN NaN 29.352010 NaN NaN NaN
19930203 NaN NaN 29.372680 NaN NaN NaN 29.662066 NaN NaN NaN
19930204 NaN NaN 29.744748 NaN NaN NaN 29.827430 NaN NaN NaN
Symbol SPY XLE DIA QQQ SPY XLE DIA \
Low Low Close Close Close Close Total Volume
Date
19930129 28.938601 NaN NaN NaN 29.062624 NaN NaN
19930201 29.083294 NaN NaN NaN 29.269328 NaN NaN
19930202 29.186647 NaN NaN NaN 29.331340 NaN NaN
19930203 29.352010 NaN NaN NaN 29.641396 NaN NaN
19930204 29.414021 NaN NaN NaN 29.765419 NaN NaN
Symbol QQQ SPY XLE
Total Volume Total Volume Total Volume
Date
19930129 NaN 15167 NaN
19930201 NaN 7264 NaN
19930202 NaN 3043 NaN
19930203 NaN 8004 NaN
19930204 NaN 8035 NaN
How does one go about subsetting for a particular day and for a particular column value, say Closing prices for all symbols?
19930129 NaN NaN 29.062624 NaN
i tried pt['Close'], but it didn't seem to work. Only pt['SPY'] gives me the whole table values for symbol SPY.
An alternative is to use xs, "cross-section":
In [21]: df.xs(axis=1, level=1, key="Open")
Out[21]:
Symbol DIA QQQ SPY XLE
Date
19930129 NaN NaN 29.083294 NaN
19930201 NaN NaN 29.083294 NaN
19930202 NaN NaN 29.248658 NaN
19930203 NaN NaN 29.372680 NaN
19930204 NaN NaN 29.744748 NaN
In [22]: df.xs(axis=1, level=1, key="Open").loc[19930129]
Out[22]:
Symbol
DIA NaN
QQQ NaN
SPY 29.083294
XLE NaN
Name: 19930129, dtype: float64
This is somewhat less powerful that unutbu's answer (using IndexSlice).
You could use pd.IndexSlice:
pt = pt.sortlevel(axis=1)
pt.loc['19930129', pd.IndexSlice[:,'Close']]
Using IndexSlicer requires the selection axes are fully lexsorted, hence the call to sortlevel.
Alternatively, slice(None) could also be used to select everything from the first column index level:
pt = pt.sortlevel(axis=1)
pt.loc['19930129', (slice(None), 'Close')]
To select the ith row, but select the columns by label, you could use
pt.loc[pt.index[i], (slice(None), 'Close')]
Or, you could use pt.ix as Andy Hayden suggests, but be aware that if pt has
an integer-valued index, then pt.ix performs label-based row indexing, not
ordinal indexing.
So as long as 19930129 (and the other index values) are not integers -- i.e. pt.index is not a Int64Index -- you could use
pt.ix[i, (slice(None), 'Close')]
Note that chained indexing, such as
pt.iloc[i].loc[(slice(None), 'Close')]
should be avoided when performing assignments, since assignment with chained indexing may fail to modify pt.

Categories

Resources