I am fairly new to python and coding in general.
I have a big data file that provides daily data for the period 2011-2018 for a number of stock tickers (300~).
The data is a .csv file with circa 150k rows and looks as follows (short example):
Date,Symbol,ShortExemptVolume,ShortVolume,TotalVolume
20110103,AAWW,0.0,28369,78113.0
20110103,AMD,0.0,3183556,8095093.0
20110103,AMRS,0.0,14196,18811.0
20110103,ARAY,0.0,31685,77976.0
20110103,ARCC,0.0,177208,423768.0
20110103,ASCMA,0.0,3930,26527.0
20110103,ATI,0.0,193772,301287.0
20110103,ATSG,0.0,23659,72965.0
20110103,AVID,0.0,7211,18896.0
20110103,BMRN,0.0,21740,213974.0
20110103,CAMP,0.0,2000,11401.0
20110103,CIEN,0.0,625165,1309490.0
20110103,COWN,0.0,3195,24293.0
20110103,CSV,0.0,6133,25394.0
I have a function that allows me to filter for a specific symbol and get 10 observations before and after a specified date (could be any date between 2011 and 2018).
import pandas as pd
from datetime import datetime
import urllib
import datetime
def get_data(issue_date, stock_ticker):
df = pd.read_csv (r'D:\Project\Data\Short_Interest\exampledata.csv')
df['Date'] = pd.to_datetime(df['Date'], format="%Y%m%d")
d = df
df = pd.DataFrame(d)
short = df.loc[df.Symbol.eq(stock_ticker)]
# get the index of the row of interest
ix = short[short.Date.eq(issue_date)].index[0]
# get the item row for that row's index
iloc_ix = short.index.get_loc(ix)
# get the +/-1 iloc rows (+2 because that is how slices work), basically +1 and -1 trading days
short_data = short.iloc[iloc_ix-10: iloc_ix+11]
return [short_data]
I want to create a script that iterates a list of 'issue_dates' and 'stock_tickers'. The list (a .csv) looks as following:
ARAY,07/08/2017
ARAY,24/04/2014
ACETQ,16/11/2015
ACETQ,16/11/2015
NVLNA,15/08/2014
ATSG,29/09/2017
ATI,24/05/2016
MDRX,18/06/2013
MDRX,18/06/2013
AMAGX,10/05/2017
AMAGX,14/02/2014
AMD,14/09/2016
To break down my problem and question I would like to know how to do the following:
First, how do I load the inputs?
Second, how do I call the function on each input?
And last, how do I accumulate all the function returns in one dataframe?
To load the inputs and call the function for each row; iterate over the csv file and pass each row's values to the function and accumulate the resulting Seriesin a list.
I modified your function a bit: removed the DataFrame creation so it is only done once and added a try/except block to account for missing dates or tickers (your example data didn't match up too well). The dates in the second csv look like they are day/month/year so I converted them for that format.
import pandas as pd
import datetime, csv
def get_data(df, issue_date, stock_ticker):
'''Return a Series for the ticker centered on the issue date.
'''
short = df.loc[df.Symbol.eq(stock_ticker)]
# get the index of the row of interest
try:
ix = short[short.Date.eq(issue_date)].index[0]
# get the item row for that row's index
iloc_ix = short.index.get_loc(ix)
# get the +/-1 iloc rows (+2 because that is how slices work), basically +1 and -1 trading days
short_data = short.iloc[iloc_ix-10: iloc_ix+11]
except IndexError:
msg = f'no data for {stock_ticker} on {issue_date}'
#log.info(msg)
print(msg)
short_data = None
return short_data
df = pd.read_csv (datafile)
df['Date'] = pd.to_datetime(df['Date'], format="%Y%m%d")
results = []
with open('issues.csv') as issues:
for ticker,date in csv.reader(issues):
day,month,year = map(int,date.split('/'))
# dt = datetime.datetime.strptime(date, r'%d/%m/%Y')
date = datetime.date(year,month,day)
s = get_data(df,date,ticker)
results.append(s)
# print(s)
Creating a single DataFrame or table for all that info may be problematic especially since the date ranges are all different. Probably should ask a separate question regarding that. Its mcve should probably just include a few minimal Pandas Series with a couple of different date ranges and tickers.
I have data in CSV like this:
1940-10-01,somevalue
1940-11-02,somevalue
1940-11-03,somevalue
1940-11-04,somevalue
1940-12-05,somevalue
1940-12-06,somevalue
1941-01-07,somevalue
1941-02-08,somevalue
1941-03-09,somevalue
1941-05-01,somevalue
1941-06-02,somevalue
1941-07-03,somevalue
1941-10-04,somevalue
1941-12-05,somevalue
1941-12-06,somevalue
1942-01-07,somevalue
1942-02-08,somevalue
1942-03-09,somevalue
I want to separate the dates from 1-oct-year to 31-march-next-year for all data. So for data above output will be:
1940/1941:
1940-11-02,somevalue
1940-11-03,somevalue
1940-11-04,somevalue
1940-12-05,somevalue
1940-12-06,somevalue
1941-01-07,somevalue
1941-02-08,somevalue
1941-03-09,somevalue
1941/1942:
1941-10-04,somevalue
1941-12-05,somevalue
1941-12-06,somevalue
1942-01-07,somevalue
1942-02-08,somevalue
1942-03-09,somevalue
1942-10-01,somevalue
My code trails are:
import csv
from datetime import datetime
with open('data.csv','r') as f:
data = list(csv.reader(f))
quaters = []
year = datetime.strptime(data[0][0], '%Y-%m-%d').year
for each in data:
date = datetime.strptime(each[0], '%Y-%m-%d')
print(each)
if (date>=datetime(year=date.year,month=10,day=1) and date<=datetime(year=date.year+1,month=3,day=31)):
middle_quaters[-1].append(each)
if year != date.year:
quaters.append([])
But I am not getting expected output. I want to store each range of dates in separate list.
I would use pandas dataframe to do this..
it would be easier..
follow this
Pandas: Selecting DataFrame rows between two dates (Datetime Index)
so for your case
data = pd.read_csv("data.csv")
df.loc[startDate : endDate]
# you can walk through a bunch of ranges like so..
listOfDateRanges = [(), (), ()]
for date_range in listOfDateRanges:
df.loc[date_range[0] : date_range[1]]
Without external packages... create a lookup based on the field of choice, and then make an int of it and do a less that vs greater than to establish the range.
import re
data = '''1940-10-01,somevalue
1940-11-02,somevalue
1940-11-03,somevalue
1940-11-04,somevalue
1940-12-05,somevalue
1940-12-06,somevalue
1941-01-07,somevalue
1941-02-08,somevalue
1941-03-09,somevalue
1941-05-01,somevalue
1941-06-02,somevalue
1941-07-03,somevalue
1941-10-04,somevalue
1941-12-05,somevalue
1941-12-06,somevalue
1942-01-07,somevalue
1942-02-08,somevalue
1942-03-09,somevalue'''
lookup={}
lines = data.split('\n')
for line in lines:
d = re.sub(r'-','',line.split(',')[0])
lookup[d]=line
dates=sorted(lookup.keys())
_in=19401201
out=19411004
outfile=[]
for date in dates:
if int(date) > _in and int(date) < out:
outfile.append(lookup[date])
for l in outfile:
print outfile
For this purpose you can use pandas library. Here is the sample code for the same:
import pandas as pd
df = pd.read_csv('so.csv', parse_dates=['timestamp']) #timestamp is your time column
current_year, next_year = 1940, 1941
df = df.query(f'(timestamp >= "{current_year}-10-01") & (timestamp <= "{next_year}-03-31")')
print (df)
This gives following result on your data:
timestamp value
0 1940-10-01 somevalue
1 1940-11-02 somevalue
2 1940-11-03 somevalue
3 1940-11-04 somevalue
4 1940-12-05 somevalue
5 1940-12-06 somevalue
6 1941-01-07 somevalue
7 1941-02-08 somevalue
8 1941-03-09 somevalue
Hope this helps!
SOLVED Found the solution by myself. Turns out that when you want to retrieve specific columns by their names you should pass the names in the order they appear inside the csv (which is really stupid for a library that is intended to save some parsing time for a developer IMO). Correct me if I am wrong but i dont see a on option to get a specific columns values by its name if the columns are in a different order...
I am trying to read a comma separated value file with python and then
parse it using Pandas library. Since the file has many values (columns) that are not needed I make a list of the column names i do need.
Here's a look at the csv file format.
Div,Date,HomeTeam,AwayTeam,FTHG,FTAG,FTR,HTHG,HTAG,HTR,Attendance,Referee,HS,AS,HST,AST,HHW,AHW,HC,AC,HF,AF,HO,AO,HY,AY,HR,AR,HBP,ABP,GBH,GBD,GBA,IWH,IWD,IWA,LBH,LBD,LBA,SBH,SBD,SBA,WHH,WHD,WHA
E0,19/08/00,Charlton,Man City,4,0,H,2,0,H,20043,Rob
Harris,17,8,14,4,2,1,6,6,13,12,8,6,1,2,0,0,10,20,2,3,3.2,2.2,2.9,2.7,2.2,3.25,2.75,2.2,3.25,2.88,2.1,3.2,3.1
E0,19/08/00,Chelsea,West Ham,4,2,H,1,0,H,34914,Graham
Barber,17,12,10,5,1,0,7,7,19,14,2,3,1,2,0,0,10,20,1.47,3.4,5.2,1.6,3.2,4.2,1.5,3.4,6,1.5,3.6,6,1.44,3.6,6.5
E0,19/08/00,Coventry,Middlesbrough,1,3,A,1,1,D,20624,Barry
Knight,6,16,3,9,0,1,8,4,15,21,1,3,5,3,1,0,75,30,2.15,3,3,2.2,2.9,2.7,2.25,3.2,2.75,2.3,3.2,2.75,2.3,3.2,2.62
E0,19/08/00,Derby,Southampton,2,2,D,1,2,A,27223,Andy
D'Urso,6,13,4,6,0,0,5,8,11,13,0,2,1,1,0,0,10,10,2,3.1,3.2,1.8,3,3.5,2.2,3.25,2.75,2.05,3.2,3.2,2,3.2,3.2
E0,19/08/00,Leeds,Everton,2,0,H,2,0,H,40010,Dermot
Gallagher,17,12,8,6,0,0,6,4,21,20,6,1,1,3,0,0,10,30,1.65,3.3,4.3,1.55,3.3,4.5,1.55,3.5,5,1.57,3.6,5,1.61,3.5,4.5
E0,19/08/00,Leicester,Aston Villa,0,0,D,0,0,D,21455,Mike
Riley,5,5,4,3,0,0,5,4,12,12,1,4,2,3,0,0,20,30,2.15,3.1,2.9,2.3,2.9,2.5,2.35,3.2,2.6,2.25,3.25,2.75,2.4,3.25,2.5
E0,19/08/00,Liverpool,Bradford,1,0,H,0,0,D,44183,Paul
Durkin,16,3,10,2,0,0,6,1,8,8,5,0,1,1,0,0,10,10,1.25,4.1,7.2,1.25,4.3,8,1.35,4,8,1.36,4,8,1.33,4,8
This list is passed to pandas.read_csv()'s names parameter.
See code.
# Returns an array of the column names needed for our raw data table
def cols_to_extract():
cols_to_use = [None] * RawDataCols.COUNT
cols_to_use[RawDataCols.DATE] = 'Date'
cols_to_use[RawDataCols.HOME_TEAM] = 'HomeTeam'
cols_to_use[RawDataCols.AWAY_TEAM] = 'AwayTeam'
cols_to_use[RawDataCols.FTHG] = 'FTHG'
cols_to_use[RawDataCols.HG] = 'HG'
cols_to_use[RawDataCols.FTAG] = 'FTAG'
cols_to_use[RawDataCols.AG] = 'AG'
cols_to_use[RawDataCols.FTR] = 'FTR'
cols_to_use[RawDataCols.RES] = 'Res'
cols_to_use[RawDataCols.HTHG] = 'HTHG'
cols_to_use[RawDataCols.HTAG] = 'HTAG'
cols_to_use[RawDataCols.HTR] = 'HTR'
cols_to_use[RawDataCols.ATTENDANCE] = 'Attendance'
cols_to_use[RawDataCols.HS] = 'HS'
cols_to_use[RawDataCols.AS] = 'AS'
cols_to_use[RawDataCols.HST] = 'HST'
cols_to_use[RawDataCols.AST] = 'AST'
cols_to_use[RawDataCols.HHW] = 'HHW'
cols_to_use[RawDataCols.AHW] = 'AHW'
cols_to_use[RawDataCols.HC] = 'HC'
cols_to_use[RawDataCols.AC] = 'AC'
cols_to_use[RawDataCols.HF] = 'HF'
cols_to_use[RawDataCols.AF] = 'AF'
cols_to_use[RawDataCols.HFKC] = 'HFKC'
cols_to_use[RawDataCols.AFKC] = 'AFKC'
cols_to_use[RawDataCols.HO] = 'HO'
cols_to_use[RawDataCols.AO] = 'AO'
cols_to_use[RawDataCols.HY] = 'HY'
cols_to_use[RawDataCols.AY] = 'AY'
cols_to_use[RawDataCols.HR] = 'HR'
cols_to_use[RawDataCols.AR] = 'AR'
return cols_to_use
# Extracts raw data from the raw data csv and populates the raw match data table in the database
def extract_raw_data(csv):
# Clear the database table if it has any logs
# if MatchRawData.objects.count != 0:
# MatchRawData.objects.delete()
cols_to_use = cols_to_extract()
# Read and parse the csv file
parsed_csv = pd.read_csv(csv, delimiter=',', names=cols_to_use, header=0)
for col in cols_to_use:
values = parsed_csv[col].values
for val in values:
print(str(col) + ' --------> ' + str(val))
Where RawDataCols is an IntEnum.
class RawDataCols(IntEnum):
DATE = 0
HOME_TEAM = 1
AWAY_TEAM = 2
FTHG = 3
HG = 4
FTAG = 5
AG = 6
FTR = 7
RES = 8
...
The column names are obtained using it. That part of code works ok. The correct column name is obtained but after trying to get its values using
values = parsed_csv[col].values
pandas return the values of a wrong column. The wrong column index is around 13 indexes away from the one i am trying to get. What am i missing?
You can select column by name wise.Just use following line
values = parsed_csv[["Column Name","Column Name2"]]
Or you select Index wise by
cols = [1,2,3,4]
values = parsed_csv[parsed_csv.columns[cols]]
I am having trouble reformatting a dataframe.
My input is a day value rows by symbols columns (each symbol has different dates with it's values):
Input
code to generate input
data = [("01-01-2010", 15, 10), ("02-01-2010", 16, 11), ("03-01-2010", 16.5, 10.5)]
labels = ["date", "AAPL", "AMZN"]
df_input = pd.DataFrame.from_records(data, columns=labels)
The needed output is (month row with new row for each month):
Needed output
code to generate output
data = [("01-01-2010","29-01-2010", "AAPL", 15, 20), ("01-01-2010","29-01-2010", "AMZN", 10, 15),("02-02-2010","30-02-2010", "AAPL", 20, 32)]
labels = ['bd start month', 'bd end month','stock', 'start_month_value', "end_month_value"]
df = pd.DataFrame.from_records(data, columns=labels)
Meaning (Pseudo code)
1. for each row take only non nan values to create a new "row" (maybe dictionary with the date as the index and the [stock, value] as the value.
2. take only rows that are business start of month or business end of month.
3. write those rows to a new datatframe.
I have read several posts like this and this and several more.
All treat with dataframe of the same "type" and just resampling while I need to change to structure...
My code so far
# creating the new index with business days
df1 =pd.DataFrame(range(10000), index = pd.date_range(df.iloc[0].name, periods=10000, freq='D'))
from pandas.tseries.offsets import CustomBusinessMonthBegin
from pandas.tseries.holiday import USFederalHolidayCalendar
bmth_us = CustomBusinessMonthBegin(calendar=USFederalHolidayCalendar())
df2 = df1.resample(bmth_us).mean()
# creating the new index interseting my old one (daily) with the monthly index
new_index = df.index.intersection(df2.index)
# selecting only the rows I want
df = df.loc[new_index]
# creating a dict that will be my new dataset
new_dict = collections.OrderedDict()
# iterating over the rows and adding to dictionary
for index, row in df.iterrows():
# print index
date = df.loc[index].name
# values are the not none values
values = df.loc[index][~df.loc[index].isnull().values]
new_dict[date]=values
# from dict to list
data=[]
for key, values in new_dict.iteritems():
for i in range(0, len(values)):
date = key
stock_name = str(values.index[i])
stock_value = values.iloc[i]
row = (key, stock_name, stock_value)
data.append(row)
# from the list to df
labels = ['date','stock', 'value']
df = pd.DataFrame.from_records(data, columns=labels)
df.to_excel("migdal_format.xls")
Current output I get
One big problem:
I only get value of the stock on the start of month day.. I need start and end so I can calculate the stock gain on this month..
One smaller problem:
I am sure this is not the cleanest and fastest code :)
Thanks a lot!
So I have found a way.
looping through each column
groupby month
taking the first and last value I have in that month
calculate return
df_migdal = pd.DataFrame()
for col in df_input.columns[0:]:
stock_position = df_input.loc[:,col]
name = stock_position.name
name = re.sub('[^a-zA-Z]+', '', name)
name = name[0:-4]
stock_position=stock_position.groupby([pd.TimeGrouper('M')]).agg(['first', 'last'])
stock_position["name"] = name
stock_position["return"] = ((stock_position["last"] / stock_position["first"]) - 1) * 100
stock_position.dropna(inplace=True)
df_migdal=df_migdal.append(stock_position)
df_migdal=df_migdal.round(decimals=2)
I tried I way cooler way, but did not know how to handle the ,multi index I got... I needed that for each column, to take the two sub columns and create a third one from some lambda function.
df_input.groupby([pd.TimeGrouper('M')]).agg(['first', 'last'])