I have a dataframe of historical stock trades. The frame has columns like ['ticker', 'date', 'cusip', 'profit', 'security_type']. Initially:
trades['cusip'] = np.nan
trades['security_type'] = np.nan
I have historical config files that I can load into frames that have columns like ['ticker', 'cusip', 'date', 'name', 'security_type', 'primary_exchange'].
I would like to UPDATE the trades frame with the cusip and security_type from config, but only where the ticker and date match.
I thought I could do something like:
pd.merge(trades, config, on=['ticker', 'date'], how='left')
But that doesn't update the columns, it just adds the config columns to trades.
The following works, but I think there has to be a better way. If not, I will probably do it outside of pandas.
for date in trades['date'].unique():
config = get_config_file_as_df(date)
## config['date'] == date
for ticker in trades['ticker'][trades['date'] == date]:
trades['cusip'][
(trades['ticker'] == ticker)
& (trades['date'] == date)
] \
= config['cusip'][config['ticker'] == ticker].values[0]
trades['security_type'][
(trades['ticker'] == ticker)
& (trades['date'] == date)
] \
= config['security_type'][config['ticker'] == ticker].values[0]
Suppose you have this setup:
import pandas as pd
import numpy as np
import datetime as DT
nan = np.nan
trades = pd.DataFrame({'ticker' : ['IBM', 'MSFT', 'GOOG', 'AAPL'],
'date' : pd.date_range('1/1/2000', periods = 4),
'cusip' : [nan, nan, 100, nan]
})
trades = trades.set_index(['ticker', 'date'])
print(trades)
# cusip
# ticker date
# IBM 2000-01-01 NaN
# MSFT 2000-01-02 NaN
# GOOG 2000-01-03 100 # <-- We do not want to overwrite this
# AAPL 2000-01-04 NaN
config = pd.DataFrame({'ticker' : ['IBM', 'MSFT', 'GOOG', 'AAPL'],
'date' : pd.date_range('1/1/2000', periods = 4),
'cusip' : [1,2,3,nan]})
config = config.set_index(['ticker', 'date'])
# Let's permute the index to show `DataFrame.update` correctly matches rows based on the index, not on the order of the rows.
new_index = sorted(config.index)
config = config.reindex(new_index)
print(config)
# cusip
# ticker date
# AAPL 2000-01-04 NaN
# GOOG 2000-01-03 3
# IBM 2000-01-01 1
# MSFT 2000-01-02 2
Then you can update NaN values in trades with values from config using the DataFrame.update method. Note that DataFrame.update matches rows based on indices (which is why set_index was called above).
trades.update(config, join = 'left', overwrite = False)
print(trades)
# cusip
# ticker date
# IBM 2000-01-01 1
# MSFT 2000-01-02 2
# GOOG 2000-01-03 100 # If overwrite = True, then 100 is overwritten by 3.
# AAPL 2000-01-04 NaN
Related
I've been using pandas for a while but I am having a really strange issue with simple filtering in pandas.
OS: Mac OS
IDE: VSCode
Pandas version: 1.4.2
I am fetching the latest data from crypto exchanges(via ccxt api) and append them into a dataframe.
limit = 28
timeframe = '1h'
futures_exchange = ccxt.kucoinfutures({
'apiKey' : MY_API_KEY,
'secret' : MY_API_SECRET,
'enableRateLimit': True,
'password': MY_KUCOIN_PASS_PHRASE,
})
all_future_list = [ 'BTC/USDT:USDT', 'BTC/USD:BTC', 'ETH/USDT:USDT', 'BCH/USDT:USDT', 'BSV/USDT:USDT', ]
attempts = 0
while attempts<= 20:
alldf = pd.DataFrame()
try:
for i in all_future_list:
df = futures_exchange.fetchOHLCV(i, limit=limit, timeframe=timeframe)
df = pd.DataFrame(df, columns=[['timestamp', 'open', 'high', 'low', 'close', 'volume']])
df['ticker'] = i
df['timestamp'] = df['timestamp'].astype('datetime64[ms]')
df = df[['timestamp', 'close', 'volume', 'ticker']]
df = df.tail(1)
alldf = pd.concat([alldf, df], ignore_index=True)
time.sleep(0.1)
except:
print(traceback.format_exc())
print(' ___ Network Error, restart fetching data _____ ')
attempts += 1
time.sleep(8)
continue
break
So far so good. Dataframe looks like...
timestamp close volume ticker
0 2022-09-17 10:00:00 19836.00 789336.0 BTC/USDT:USDT
1 2022-09-17 10:00:00 1411.2 982840 ETH/USDT:USDT
2 2022-09-17 10:00:00 120.95 55564.0 BCH/USDT:USDT
However, from there if I want to remove/filter a ticker from dataframe
new_df = alldf[alldf['ticker']!= 'BTC/USDT:USDT']
print(new_df)
Erroneous output:
timestamp close volume ticker
0 NaT NaN NaN NaN
1 NaT NaN NaN ETH/USDT:USDT
2 NaT NaN NaN BCH/USDT:USDT
3 NaT NaN NaN BSV/USDT:USDT
This should be a simple remove/filter but I dont understand why 'timestamp', 'close' and 'volume' columns are NaN
It seems if I write dataframe to a csv and then read it then I can get what I want.
alldf.to_csv('alldf.csv')
alldf = pd.read_csv('alldf.csv',index_col=0)
new_df = alldf[alldf['ticker']!= 'BTC/USDT:USDT']
print(new_df)
Output:
timestamp close volume ticker
1 2022-09-17 10:00:00 1411.2 982840 ETH/USDT:USDT
0 2022-09-17 10:00:00 120.95 55564.0 BCH/USDT:USDT
2 2022-09-17 11:00:00 51.95 3628.0 BSV/USDT:USDT
However, I dont want to write dataframe to csv and read it again just to filter out some tickers.
Can anyone help me ? Not sure whats going on here.
I guess this happens because that this piece of code
df = df[['timestamp', 'close', 'volume', 'ticker']]
df = df.tail(1)
subsets existing dataframe without actually copying the data. Therefore, target dataframe would get reference to original data, and when its changed - it gets broken.
Try this:
df = df[['timestamp', 'close', 'volume', 'ticker']]
df = df.tail(1).copy()
BTW, AFAIK, it's more efficient to keep new records in a dictionary and convert it to a dataframe at the end, rather then merging them one by one in a loop.
I try to build a simply stock screener. The screener should download the volume and avg. volume and should give me all stocks with the condition volume > avg. volume.
The Problem is, that something went wrong and i don't know how to fix the code. Its a overall problem in the code and i hope i can get some help. Often i get the message, that the data is empty. But i think there is something wrong with the tables and the conditions...
The first for loop gives me this and this is good, i think there is no mistake in the code
Ticker volume avg. volume
aapl 31 20
ayx 20 32
nflx 25 28
The second for loop is to check my condition ( volume > avg. volume)
but it gives me the following output:
ticker1 ... Stock
0 NaN ... ticker1
1 NaN ... Volume_1
2 NaN ... Average_Volume
[3 rows x 4 columns]
Process finished with exit code 0
normaly only apple fullfills the conditions and it must look like:
Stock volume avg. volume
aapl 31 20
Thats my code:
from yahoo_fin.stock_info import get_analysts_info, get_stats, get_live_price, get_quote_table
import pandas as pd
import ssl
ssl._create_default_https_context = ssl._create_unverified_context
tickers = ['aapl', 'ayx' , 'nflx']
exportList = pd.DataFrame(columns = ['ticker1', 'Volume_1', 'Average_Volume'])
for ticker in tickers:
df = get_stats(ticker)
df['ticker'] = ticker
df = df.pivot(index = 'ticker', columns = 'Attribute', values = 'Value')
df['Volume_1'] = get_quote_table(ticker)['Volume']
df['Average_Volume'] = get_quote_table(ticker)['Avg. Volume']
df = df[['Volume_1', 'Average_Volume']]
df = df.reset_index()
df.columns = ('ticker', 'Volume_1', 'Average_Volume')
rs_df = pd.DataFrame (columns = ['ticker1', 'Volume_1', 'Average_Volume'])
for stock in rs_df:
Volume_1 = df["Volume_1"]
Average_Volume = df["Average_Volume"]
if float(Volume_1) < float(Average_Volume):
exportList = exportList.append({'Stock': stock, "Volume_1": Volume_1, "Average_Volume": Average_Volume},
ignore_index=True)
print('\n', exportList)
I have found a ticker "NAN" (NAN:NUVEEN NEW YORK QUALITY MUNICIPAL INCOME FUND) but when I am trying to insert it into my data frame it's converting into None. I even tried to insert as str(ticker) with the value of ticker being 'NAN'. I am lost - how do I do that. Every stock ticker is working except the ticker 'NAN'
Exact code:
from previous code execution
ticker = 'NAN'
cusip = '67066X107'
cusipdf['ticker'] = np.where(( cusipdf['cusip'] == cusip ), str(ticker), cusipdf['ticker'] )
I did the following and I am able to replace the value of REPLACE to NAN. See below.
import pandas as pd
import numpy as np
c = ['ticker','cusip', 'value']
d = d = [['AMZN','51123X145',123.4567],
['REPLACE','62343X145',223.1237],
['AAPL','56789X225',312.5767],
['GOOG','42154X638',331.8793]]
import pandas as pd
df = pd.DataFrame(data=d,columns=c)
print (df)
t = 'NAN'
df['ticker'] = np.where((df['cusip'] == '62343X145' ), str(t), df['ticker'] )
print (df)
ticker cusip value
0 AMZN 51123X145 123.4567
1 REPLACE 62343X145 223.1237
2 AAPL 56789X225 312.5767
3 GOOG 42154X638 331.8793
ticker cusip value
0 AMZN 51123X145 123.4567
1 NAN 62343X145 223.1237
2 AAPL 56789X225 312.5767
3 GOOG 42154X638 331.8793
I'm doing a finance study based on the youtube link below and I would like to understand why I got the NaN return instead of the expected calculation. What do I need to do in this script to reach the expected value?
YouTube case: https://www.youtube.com/watch?v=UpbpvP0m5d8
import investpy as env
import numpy as np
import pandas as pd
lt = ['ABEV3','CEAB3','ENBR3','FLRY3','IRBR3','ITSA4','JHSF3','STBP3']
prices = pd.DataFrame()
for i in lt:
df = env.get_stock_historical_data(stock=i, from_date='01/01/2020', to_date='29/05/2020', country='brazil')
df['Ativo'] = i
prices = pd.concat([prices, df], sort=True)
pivoted = prices.pivot(columns='Ativo', values='Close')
e_r = pivoted.resample('Y').last().pct_change().mean()
e_r
Return:
Ativo
ABEV3 NaN
CEAB3 NaN
ENBR3 NaN
FLRY3 NaN
IRBR3 NaN
ITSA4 NaN
JHSF3 NaN
STBP3 NaN
dtype: float64
You need to change the 'from_date' to have more than one year of data.
You current script returns one row and .pct_change() on one row of data returns NaN, because there is no previous row to compare against.
When I changed from_date to '01/01/2018'
import investpy as env
import numpy as np
import pandas as pd
lt = ['ABEV3','CEAB3','ENBR3','FLRY3','IRBR3','ITSA4','JHSF3','STBP3']
prices = pd.DataFrame()
for i in lt:
df = env.get_stock_historical_data(stock=i, from_date='01/01/2018', to_date='29/05/2020', country='brazil')
df['Ativo'] = i
prices = pd.concat([prices, df], sort=True)
pivoted = prices.pivot(columns='Ativo', values='Close')
e_r = pivoted.resample('Y').last().pct_change().mean()
e_r
I get the following output:
Ativo
ABEV3 -0.043025
CEAB3 -0.464669
ENBR3 0.180655
FLRY3 0.191976
IRBR3 -0.175084
ITSA4 -0.035767
JHSF3 1.283291
STBP3 0.223627
dtype: float64
I want to add columns to the following Dataframe for each stock of 5 year (60 month) rolling returns. The following code is used to obtain the financial data over the period 1995 to 2010.
quandl.ApiConfig.api_key = 'Enter Key'
stocks = ['MSFT', 'AAPL', 'WMT', 'GE', 'KO']
stockdata = quandl.get_table('WIKI/PRICES', ticker = stocks, paginate=True,
qopts = { 'columns': ['date', 'ticker', 'adj_close'] },
date = { 'gte': '1995-1-1', 'lte': '2010-12-31' })
# Setting date as index with columns of tickers and adjusted closing price
df = stockdata.pivot(index = 'date',columns='ticker')
df.index = pd.to_datetime(df.index)
df.resample('1M').mean()
df = df.pct_change()
df.head()
Out[1]:
rets
ticker AAPL BA F GE JNJ KO
date
1995-01-03 NaN NaN NaN NaN NaN NaN
1995-01-04 0.026055 -0.002567 0.026911 0.000000 0.006972 -0.019369
1995-01-05 -0.012697 0.002573 -0.008735 0.002549 -0.002369 -0.004938
1995-01-06 0.080247 0.018824 0.000000 -0.004889 -0.006758 0.000000
1995-01-09 -0.019048 0.000000 0.017624 -0.009827 -0.011585 -0.014887
df.tail()
Out[2]:
rets
ticker AAPL BA F GE JNJ KO
date
2010-12-27 0.003337 -0.004765 0.005364 0.008315 -0.005141 -0.007777
2010-12-28 0.002433 0.001699 -0.008299 0.007147 0.001938 0.004457
2010-12-29 -0.000553 0.002929 0.000598 -0.002729 0.001289 0.001377
2010-12-30 -0.005011 -0.000615 -0.002987 -0.004379 -0.003058 0.000764
2010-12-31 -0.003399 0.003846 0.005992 0.005498 -0.001453 0.004122
Any assistance of how to do this would be awesome!
The problem is in the multi-level index in the columns. We can start by selecting the second level index, and after that the rolling mean works:
means = df['rets'].rolling(60).mean()
means.tail()
Gives:
The error you are receiving is due to you passing the entire dataframe into the rolling function since your frame uses a multi index. You cant pass a multi index frame to a rolling function since rolling only accepts numpy arrays of 1 column. You’ll have to probably create a for loop and return the values individually per ticker