I am fairly new to python and coding in general.
I have a big data file that provides daily data for the period 2011-2018 for a number of stock tickers (300~).
The data is a .csv file with circa 150k rows and looks as follows (short example):
Date,Symbol,ShortExemptVolume,ShortVolume,TotalVolume
20110103,AAWW,0.0,28369,78113.0
20110103,AMD,0.0,3183556,8095093.0
20110103,AMRS,0.0,14196,18811.0
20110103,ARAY,0.0,31685,77976.0
20110103,ARCC,0.0,177208,423768.0
20110103,ASCMA,0.0,3930,26527.0
20110103,ATI,0.0,193772,301287.0
20110103,ATSG,0.0,23659,72965.0
20110103,AVID,0.0,7211,18896.0
20110103,BMRN,0.0,21740,213974.0
20110103,CAMP,0.0,2000,11401.0
20110103,CIEN,0.0,625165,1309490.0
20110103,COWN,0.0,3195,24293.0
20110103,CSV,0.0,6133,25394.0
I have a function that allows me to filter for a specific symbol and get 10 observations before and after a specified date (could be any date between 2011 and 2018).
import pandas as pd
from datetime import datetime
import urllib
import datetime
def get_data(issue_date, stock_ticker):
df = pd.read_csv (r'D:\Project\Data\Short_Interest\exampledata.csv')
df['Date'] = pd.to_datetime(df['Date'], format="%Y%m%d")
d = df
df = pd.DataFrame(d)
short = df.loc[df.Symbol.eq(stock_ticker)]
# get the index of the row of interest
ix = short[short.Date.eq(issue_date)].index[0]
# get the item row for that row's index
iloc_ix = short.index.get_loc(ix)
# get the +/-1 iloc rows (+2 because that is how slices work), basically +1 and -1 trading days
short_data = short.iloc[iloc_ix-10: iloc_ix+11]
return [short_data]
I want to create a script that iterates a list of 'issue_dates' and 'stock_tickers'. The list (a .csv) looks as following:
ARAY,07/08/2017
ARAY,24/04/2014
ACETQ,16/11/2015
ACETQ,16/11/2015
NVLNA,15/08/2014
ATSG,29/09/2017
ATI,24/05/2016
MDRX,18/06/2013
MDRX,18/06/2013
AMAGX,10/05/2017
AMAGX,14/02/2014
AMD,14/09/2016
To break down my problem and question I would like to know how to do the following:
First, how do I load the inputs?
Second, how do I call the function on each input?
And last, how do I accumulate all the function returns in one dataframe?
To load the inputs and call the function for each row; iterate over the csv file and pass each row's values to the function and accumulate the resulting Seriesin a list.
I modified your function a bit: removed the DataFrame creation so it is only done once and added a try/except block to account for missing dates or tickers (your example data didn't match up too well). The dates in the second csv look like they are day/month/year so I converted them for that format.
import pandas as pd
import datetime, csv
def get_data(df, issue_date, stock_ticker):
'''Return a Series for the ticker centered on the issue date.
'''
short = df.loc[df.Symbol.eq(stock_ticker)]
# get the index of the row of interest
try:
ix = short[short.Date.eq(issue_date)].index[0]
# get the item row for that row's index
iloc_ix = short.index.get_loc(ix)
# get the +/-1 iloc rows (+2 because that is how slices work), basically +1 and -1 trading days
short_data = short.iloc[iloc_ix-10: iloc_ix+11]
except IndexError:
msg = f'no data for {stock_ticker} on {issue_date}'
#log.info(msg)
print(msg)
short_data = None
return short_data
df = pd.read_csv (datafile)
df['Date'] = pd.to_datetime(df['Date'], format="%Y%m%d")
results = []
with open('issues.csv') as issues:
for ticker,date in csv.reader(issues):
day,month,year = map(int,date.split('/'))
# dt = datetime.datetime.strptime(date, r'%d/%m/%Y')
date = datetime.date(year,month,day)
s = get_data(df,date,ticker)
results.append(s)
# print(s)
Creating a single DataFrame or table for all that info may be problematic especially since the date ranges are all different. Probably should ask a separate question regarding that. Its mcve should probably just include a few minimal Pandas Series with a couple of different date ranges and tickers.
Related
My goal:
I have two time-series data frames, one with a time interval of 1m and the other with a time interval of 5m. The 5m data frame is a resampled version of the 1m data. What I'm doing is computing a set of RSI values that correspond to the 5m df using the vectorbt library, then aligning and broadcasting these values to the 1m df using df.align
The Problem:
When trying to do this line by line, it works perfectly. Here's what the final result looks like:
However, when applying it under the function, it returns the following error while having overlapping index names:
ValueError: cannot join with no overlapping index names
Here's the complete code:
import vectorbt as vbt
import numpy as np
import pandas as pd
import datetime
end_date = datetime.datetime.now()
start_date = end_date - datetime.timedelta(days=3)
btc_price = vbt.YFData.download('BTC-USD',
interval='1m',
start=start_date,
end=end_date,
missing_index='drop').get('Close')
def custom_indicator(close, rsi_window=14, ma_window=50):
close_5m = close.resample('5T').last()
rsi = vbt.RSI.run(close_5m, window=rsi_window).rsi
rsi, _ = rsi.align(close, broadcast_axis=0, method='ffill')
print(rsi) #to check
print(close) #to check
return
#setting up indicator factory
ind = vbt.IndicatorFactory(
class_name='Combination',
short_name='comb',
input_names=['close'],
param_names=['rsi_window', 'ma_window'],
output_names=['value']).from_apply_func(custom_indicator,
rsi_window=14,
ma_window=50,
keep_pd=True)
res = ind.run(btc_price, rsi_window=21, ma_window=50)
print(res)
Thank you for taking the time to read this. Any help would be appreciated!
if you checked the columns of both , rsi and close
print('close is', close.columns)
print('rsi is', rsi.columns)
you will find
rsi is MultiIndex([(14, 'Close')],
names=['rsi_window', None])
close is Index(['Close'], dtype='object')
as it has two indexes, one should be dropped, so it can be done by the below code
rsi.columns = rsi.columns.droplevel()
to drop one level of the indexes, so it could be align,
The problem is that the data must be a time series and not a pandas data frame for table joins using align
You need to fix the data type
# Time Series
close = close['Close']
close_5m = close.resample('15min').last()
rsi = vbt.RSI.run(close_5m, window=rsi_window).rsi
rsi, _ = rsi.align(close, broadcast_axis=0, method='ffill', join='right')
When you are aligning the data make sure to include join='right'
rsi, _ = rsi.align(close, broadcast_axis=0, method='ffill', join='right'
I am working on a personal project collecting the data on Covid-19 cases. The data set only shows the total number of Covid-19 cases per state cumulatively. I would like to add a column that contains the new cases added that day. This is what I have so far:
import pandas as pd
from datetime import date
from datetime import timedelta
import numpy as np
#read the CSV from github
hist_US_State = pd.read_csv("https://raw.githubusercontent.com/nytimes/covid-19-data/master/us-states.csv")
#some code to get yesterday's date and the day before which is needed later.
today = date.today()
yesterday = today - timedelta(days = 1)
yesterday = str(yesterday)
day_before_yesterday = today - timedelta(days = 2)
day_before_yesterday = str(day_before_yesterday)
#Extracting yesterday's and the day before cases and combine them in one dataframe
yesterday_cases = hist_US_State[hist_US_State["date"] == yesterday]
day_before_yesterday_cases = hist_US_State[hist_US_State["date"] == day_before_yesterday]
total_cases = pd.DataFrame()
total_cases = day_before_yesterday_cases.append(yesterday_cases)
#Adding a new column called "new_cases" and this is where I get into trouble.
total_cases["new_cases"] = yesterday_cases["cases"] - day_before_yesterday_cases["cases"]
Can you please point out what I am doing wrong?
Because you defined total_cases as a concatenation (via append) of yesterday_cases and day_before_yesterday_cases, its number of rows is equal to the sum of the other two dataframes. It looks like yesterday_cases and day_before_yesterday_cases both have 55 rows, and so total_cases has 110 rows. Thus your last line is trying to assign 55 values to a series of 110 values.
You may either want to reshape your data so that each date is its own column, or work in arrays of dataframes.
I am trying to compare two different values in a dataframe. The questions/answers I've found I wasn't able to utilize.
import pandas as pd
# from datetime import timedelta
"""
read csv file
clean date column
convert date str to datetime
sort for equity options
replace date str column with datetime column
"""
trade_reader = pd.read_csv('TastyTrades.csv')
trade_reader['Date'] = trade_reader['Date'].replace({'T': ' ', '-0500': ''}, regex=True)
date_converter = pd.to_datetime(trade_reader['Date'], format="%Y-%m-%d %H:%M:%S")
options_frame = trade_reader.loc[(trade_reader['Instrument Type'] == 'Equity Option')]
clean_frame = options_frame.replace(to_replace=['Date'], value='date_converter')
# Separate opening transaction from closing transactions, combine frames
opens = clean_frame[clean_frame['Action'].isin(['BUY_TO_OPEN', 'SELL_TO_OPEN'])]
closes = clean_frame[clean_frame['Action'].isin(['BUY_TO_CLOSE', 'SELL_TO_CLOSE'])]
open_close_set = set(opens['Symbol']) & set(closes['Symbol'])
open_close_frame = clean_frame[clean_frame['Symbol'].isin(open_close_set)]
'''
convert Value to float
sort for trade readability
write
'''
ocf_float = open_close_frame['Value'].astype(float)
ocf_sorted = open_close_frame.sort_values(by=['Date', 'Call or Put'], ascending=True)
# for readability, revert back to ocf_sorted below
ocf_list = ocf_sorted.drop(
['Type', 'Instrument Type', 'Description', 'Quantity', 'Average Price', 'Commissions', 'Fees', 'Multiplier'], axis=1
)
ocf_list.reset_index(drop=True, inplace=True)
ocf_list['Strategy'] = ''
# ocf_list.to_csv('Sorted.csv')
# create strategy list
debit_single = []
debit_vertical = []
debit_calendar = []
credit_vertical = []
iron_condor = []
# shift columns
ocf_list['Symbol Shift'] = ocf_list['Underlying Symbol'].shift(1)
ocf_list['Symbol Check'] = ocf_list['Underlying Symbol'] == ocf_list['Symbol Shift']
# compare symbols, append depending on criteria met
for row in ocf_list:
if row['Symbol Shift'] is row['Underlying Symbol']:
debit_vertical.append(row)
print(type(ocf_list['Underlying Symbol']))
ocf_list.to_csv('Sorted.csv')
print(debit_vertical)
# delta = timedelta(seconds=10)
The error I get is:
line 51, in <module>
if row['Symbol Check'][-1] is row['Underlying Symbol'][-1]:
TypeError: string indices must be integers
I am trying to compare the newly created shifted column to the original, and if they are the same, append to a list. Is there a way to compare two string values at all in python? I've tried checking if Symbol Check is true and it still returns an error about str indices must be int. .iterrows() didn't work
Here, you will actually iterate through the columns of your DataFrame, not the rows:
for row in ocf_list:
if row['Symbol Shift'] is row['Underlying Symbol']:
debit_vertical.append(row)
You can use one of the methods iterrows or itertuples to iterate through the rows, but they return rows as lists and tuples respectively, which means you can't index them using the column names, as you did here.
Second, you should use == instead of is since you are probably comparing values, not identities.
Lastly, I would skip iterating over the rows entirely, as pandas is made for selecting rows based on a condition. You should be able to replace the aforementioned code with this:
debit_vertical = ocf_list[ocf_list['Symbol Shift'] == ocf_list['Underlying Symbol']].values.tolist()
If this question is unclear, I am very open to constructive criticism.
I have an excel table with about 50 rows of data, with the first column in each row being a date. I need to access all the data for only one date, and that date appears only about 1-5 times. It is the most recent date so I've already organized the table by date with the most recent being at the top.
So my goal is to store that date in a variable and then have Python look only for that variable (that date) and take only the columns corresponding to that variable. I need to use this code on 100's of other excel files as well, so it would need to arbitrarily take the most recent date (always at the top though).
My current code below simply takes the first 5 rows because I know that's how many times this date occurs.
import os
from numpy import genfromtxt
import pandas as pd
path = 'Z:\\folderwithcsvfile'
for filename in os.listdir(path):
file_path = os.path.join(path, filename)
if os.path.isfile(file_path):
broken_df = pd.read_csv(file_path)
df3 = broken_df['DATE']
df4 = broken_df['TRADE ID']
df5 = broken_df['AVAILABLE STOCK']
df6 = broken_df['AMOUNT']
df7 = broken_df['SALE PRICE']
print (df3)
#print (df3.head(6))
print (df4.head(6))
print (df5.head(6))
print (df6.head(6))
print (df7.head(6))
This is a relatively simple filtering operation. You state that you want to "take only the columns" that are the latest date, so I assume that an acceptable result will be a filter DataFrame with just the correct columns.
Here's a simple CSV that is similar to your structure:
DATE,TRADE ID,AVAILABLE STOCK
10/11/2016,123,123
10/11/2016,123,123
10/10/2016,123,123
10/9/2016,123,123
10/11/2016,123,123
Note that I mixed up the dates a little bit, because it's hacky and error-prone to just assume that the latest dates will be on the top. The following script will filter it appropriately:
import pandas as pd
import numpy as np
df = pd.read_csv('data.csv')
# convert the DATE column to datetimes
df['DATE'] = pd.to_datetime(df['DATE'])
# find the latest datetime
latest_date = df['DATE'].max()
# use index filtering to only choose the columns that equal the latest date
latest_rows = df[df['DATE'] == latest_date]
print (latest_rows)
# now you can perform your operations on latest_rows
In my example, this will print:
DATE TRADE ID AVAILABLE STOCK
0 2016-10-11 123 123
1 2016-10-11 123 123
4 2016-10-11 123 123
I am creating a pandas dataframe from historical weather data downloaded from weather underground.
import json
import requests
import pandas as pd
import numpy as np
import datetime
from dateutil.parser import parse
address = "http://api.wunderground.com/api/7036740167876b59/history_20060405/q/CA/San_Francisco.json"
r = requests.get(address)
wu_data = r.json()
Because I do not need all the data I only use the list of observations. This list contains two elements - date and utcdate - that are actually dictionaries.
df = pd.DataFrame.from_dict(wu_data["history"]["observations"])
I would like to index the dataframe I have created with the parsed date from the 'pretty' key within the dictionary. I can access this value by using the array index, but I can't figure out how to do this directly without a loop. For example, for the 23th element I can write
pretty_date = df["date"].values[23]["pretty"]
print pretty_date
time = parse(pretty_date)
print time
And I get
11:56 PM PDT on April 05, 2006
2006-04-05 23:56:00
This is what I am doing at the moment
g = lambda x: parse(x["pretty"])
df_dates = pd.DataFrame.from_dict(df["date"])
df.index = df_date["date"].apply(g)
df is now reindexed. At this point I can remove the columns I do not need.
Is there a more direct way to do this?
Please notice that sometimes there are multiple observations for the same date, but I deal with data cleaning, duplicates, etc. in a different part of the code.
Since the dtype held in pretty is just object, you can simply grab them to a list and get indexed. Not sure if this is what you want:
# by the way, `r.json` should be without ()`
wu_data = r.json
df = pd.DataFrame.from_dict(wu_data["history"]["observations"])
# just index using list comprehension, getting "pretty" inside df["date"] object.
df.index = [parse(df["date"][n]["pretty"]) for n in range(len(df))]
df.index
<class 'pandas.tseries.index.DatetimeIndex'>
[2006-04-05 00:56:00, ..., 2006-04-05 23:56:00]
Length: 24, Freq: None, Timezone: None
Hope this helps.