Pandas: fastest way to the DF by date - python

I have an efficiency question for you. I wrote some code to analyze a report that holds over 70k records and over 400+ unique organizations to allow my supervisor to enter in year/month/date they are interested in and have it pop out the information.
The beginning of my code is:
import pandas as pd
import numpy as np
import datetime
main_data = pd.read_excel("UpdatedData.xlsx", encoding= 'utf8')
#column names from DF
epi_expose = "EpitheliumExposureSeverity"
sloughing = "EpitheliumSloughingPercentageSurface"
organization = "OrgName"
region = "Region"
date = "DeathOn"
#list storage of definitions
sl_list = ["",'None','Mild','Mild to Moderate']
epi_list= ['Moderate','Moderate to Severe','Severe']
#Create DF with four columns
df = main_data[[region, organization, epi_expose, sloughing, date]]
#filter it down to months
starting_date = datetime.date(2017,2,1)
ending_date = datetime.date(2017,2,28)
df = df[(df[date] > starting_date) & (df[date] < ending_date)]
I am then performing conditional filtering below to get counts by region and organization. It works, but is slow. Is there a more efficient way to query my DF and set up a DF that ONLY has the dates that it is supposed to sit between? Or is this the most efficient way without altering how the Database I am using is set up?
I can provide more of my code but if I filter it out by month before exporting to excel, the code runs in a matter of seconds so I am not concerned about the speed of it besides getting the correct date fields.
Thank you!

Related

How to View Several Excel Rows in Python

So, I am trying to create a Python program that reads a password protected excel file. The program is intended to report any names expiring between 90 and 105 days. The problem I am running into right now is getting the program to read multiple rows. I've been using import xlrd. I was hoping that 'counter' would change the row being read, but only the first row is being read.
Edit: Solved. I was able to use the code below to get my program to display entries that are expiring within my time field.
import pandas as pd
from datetime import date, timedelta
today = date.today()
ninety_Days = (date.today()+timedelta(days=90))
hundred_Days = (date.today()+timedelta(days=105))
hundred_Days = '%s-%s-%s' % (hundred_Days.month, hundred_Days.day,
hundred_Days.year)
ninety_Days = '%s-%s-%s' % (ninety_Days.month, ninety_Days.day,
ninety_Days.year)
wkbk = pd.read_excel('Practice Inventory.xlsx', 'Sheet1')
mask = (wkbk['Expiration'] >= ninety_Days) & (wkbk['Expiration'] <=
hundred_Days)
wkbk = wkbk.loc[mask]
print(wkbk)
Use Pandas!
import pandas as pd
df = pd.read_excel('Practice Inventory.xlsx')
new_df = df[df['days to expiration'] >= 90]
final_df = pd.concat([df[df['days to expiration'] <= 120], new_df]
The final_df will hold all the rows with days of expiration greater than 90 and less than 120.

How do I make this function iterable (getting indexerror)

I am fairly new to python and coding in general.
I have a big data file that provides daily data for the period 2011-2018 for a number of stock tickers (300~).
The data is a .csv file with circa 150k rows and looks as follows (short example):
Date,Symbol,ShortExemptVolume,ShortVolume,TotalVolume
20110103,AAWW,0.0,28369,78113.0
20110103,AMD,0.0,3183556,8095093.0
20110103,AMRS,0.0,14196,18811.0
20110103,ARAY,0.0,31685,77976.0
20110103,ARCC,0.0,177208,423768.0
20110103,ASCMA,0.0,3930,26527.0
20110103,ATI,0.0,193772,301287.0
20110103,ATSG,0.0,23659,72965.0
20110103,AVID,0.0,7211,18896.0
20110103,BMRN,0.0,21740,213974.0
20110103,CAMP,0.0,2000,11401.0
20110103,CIEN,0.0,625165,1309490.0
20110103,COWN,0.0,3195,24293.0
20110103,CSV,0.0,6133,25394.0
I have a function that allows me to filter for a specific symbol and get 10 observations before and after a specified date (could be any date between 2011 and 2018).
import pandas as pd
from datetime import datetime
import urllib
import datetime
def get_data(issue_date, stock_ticker):
df = pd.read_csv (r'D:\Project\Data\Short_Interest\exampledata.csv')
df['Date'] = pd.to_datetime(df['Date'], format="%Y%m%d")
d = df
df = pd.DataFrame(d)
short = df.loc[df.Symbol.eq(stock_ticker)]
# get the index of the row of interest
ix = short[short.Date.eq(issue_date)].index[0]
# get the item row for that row's index
iloc_ix = short.index.get_loc(ix)
# get the +/-1 iloc rows (+2 because that is how slices work), basically +1 and -1 trading days
short_data = short.iloc[iloc_ix-10: iloc_ix+11]
return [short_data]
I want to create a script that iterates a list of 'issue_dates' and 'stock_tickers'. The list (a .csv) looks as following:
ARAY,07/08/2017
ARAY,24/04/2014
ACETQ,16/11/2015
ACETQ,16/11/2015
NVLNA,15/08/2014
ATSG,29/09/2017
ATI,24/05/2016
MDRX,18/06/2013
MDRX,18/06/2013
AMAGX,10/05/2017
AMAGX,14/02/2014
AMD,14/09/2016
To break down my problem and question I would like to know how to do the following:
First, how do I load the inputs?
Second, how do I call the function on each input?
And last, how do I accumulate all the function returns in one dataframe?
To load the inputs and call the function for each row; iterate over the csv file and pass each row's values to the function and accumulate the resulting Seriesin a list.
I modified your function a bit: removed the DataFrame creation so it is only done once and added a try/except block to account for missing dates or tickers (your example data didn't match up too well). The dates in the second csv look like they are day/month/year so I converted them for that format.
import pandas as pd
import datetime, csv
def get_data(df, issue_date, stock_ticker):
'''Return a Series for the ticker centered on the issue date.
'''
short = df.loc[df.Symbol.eq(stock_ticker)]
# get the index of the row of interest
try:
ix = short[short.Date.eq(issue_date)].index[0]
# get the item row for that row's index
iloc_ix = short.index.get_loc(ix)
# get the +/-1 iloc rows (+2 because that is how slices work), basically +1 and -1 trading days
short_data = short.iloc[iloc_ix-10: iloc_ix+11]
except IndexError:
msg = f'no data for {stock_ticker} on {issue_date}'
#log.info(msg)
print(msg)
short_data = None
return short_data
df = pd.read_csv (datafile)
df['Date'] = pd.to_datetime(df['Date'], format="%Y%m%d")
results = []
with open('issues.csv') as issues:
for ticker,date in csv.reader(issues):
day,month,year = map(int,date.split('/'))
# dt = datetime.datetime.strptime(date, r'%d/%m/%Y')
date = datetime.date(year,month,day)
s = get_data(df,date,ticker)
results.append(s)
# print(s)
Creating a single DataFrame or table for all that info may be problematic especially since the date ranges are all different. Probably should ask a separate question regarding that. Its mcve should probably just include a few minimal Pandas Series with a couple of different date ranges and tickers.

Python Pandas - convert unicode data into dataframe so I can append

I am pulling data using pytreasurydirect and I would like to query each unique cusip and then append them and create a pandas dataframe table. I am having difficulties generating the the pandas dataframe. I believe it is because of the unicode structure of the data.
import pandas as pd
from pytreasurydirect import TreasuryDirect
td = TreasuryDirect()
cusip_list = [['912796PY9','08/09/2018'],['912796PY9','06/07/2018']]
for i in cusip_list:
cusip =''.join(i[0])
issuedate =''.join(i[1])
cusip_value=(td.security_info(cusip, issuedate))
#pd.DataFrame(cusip_value.items())
df = pd.DataFrame(cusip_value, index=['a'])
td = td.append(df, ignore_index=False)
Example of data from pytreasurydirect :
Index([u'accruedInterestPer100', u'accruedInterestPer1000',
u'adjustedAccruedInterestPer1000', u'adjustedPrice',
u'allocationPercentage', u'allocationPercentageDecimals',
u'announcedCusip', u'announcementDate', u'auctionDate',
u'auctionDateYear',
...
u'totalTendered', u'treasuryDirectAccepted',
u'treasuryDirectTendersAccepted', u'type',
u'unadjustedAccruedInterestPer1000', u'unadjustedPrice',
u'updatedTimestamp', u'xmlFilenameAnnouncement',
u'xmlFilenameCompetitiveResults', u'xmlFilenameSpecialAnnouncement'],
dtype='object', length=116)
I think you want to define a function like this:
def securities(type):
secs = td.security_type(type)
keys = secs[0].keys() if secs else []
seri = [pd.Series([sec[key] for sec in secs]) for key in keys]
return pd.DataFrame(dict(zip(keys, seri)))
Then, use it:
df = securities('Bond')
df[['cusip', 'issueDate', 'maturityDate']].head()
to get results like these, for example (TreasuryDirect returns a lot of addition columns):
cusip issueDate maturityDate
0 912810SD1 2018-08-15T00:00:00 2048-08-15T00:00:00
1 912810SC3 2018-07-16T00:00:00 2048-05-15T00:00:00
2 912810SC3 2018-06-15T00:00:00 2048-05-15T00:00:00
3 912810SC3 2018-05-15T00:00:00 2048-05-15T00:00:00
4 912810SA7 2018-04-16T00:00:00 2048-02-15T00:00:00
At least today those are the results today. The results will change over time as bonds are issued and, alas, mature. Note the multiple issueDates per cusip.
Finally, per the TreasuryDirect website (https://www.treasurydirect.gov/webapis/webapisecurities.htm), the possible security types are: Bill, Note, Bond, CMB, TIPS, FRN.

Python: store a value in a variable so that you can recognize each reoccurence

If this question is unclear, I am very open to constructive criticism.
I have an excel table with about 50 rows of data, with the first column in each row being a date. I need to access all the data for only one date, and that date appears only about 1-5 times. It is the most recent date so I've already organized the table by date with the most recent being at the top.
So my goal is to store that date in a variable and then have Python look only for that variable (that date) and take only the columns corresponding to that variable. I need to use this code on 100's of other excel files as well, so it would need to arbitrarily take the most recent date (always at the top though).
My current code below simply takes the first 5 rows because I know that's how many times this date occurs.
import os
from numpy import genfromtxt
import pandas as pd
path = 'Z:\\folderwithcsvfile'
for filename in os.listdir(path):
file_path = os.path.join(path, filename)
if os.path.isfile(file_path):
broken_df = pd.read_csv(file_path)
df3 = broken_df['DATE']
df4 = broken_df['TRADE ID']
df5 = broken_df['AVAILABLE STOCK']
df6 = broken_df['AMOUNT']
df7 = broken_df['SALE PRICE']
print (df3)
#print (df3.head(6))
print (df4.head(6))
print (df5.head(6))
print (df6.head(6))
print (df7.head(6))
This is a relatively simple filtering operation. You state that you want to "take only the columns" that are the latest date, so I assume that an acceptable result will be a filter DataFrame with just the correct columns.
Here's a simple CSV that is similar to your structure:
DATE,TRADE ID,AVAILABLE STOCK
10/11/2016,123,123
10/11/2016,123,123
10/10/2016,123,123
10/9/2016,123,123
10/11/2016,123,123
Note that I mixed up the dates a little bit, because it's hacky and error-prone to just assume that the latest dates will be on the top. The following script will filter it appropriately:
import pandas as pd
import numpy as np
df = pd.read_csv('data.csv')
# convert the DATE column to datetimes
df['DATE'] = pd.to_datetime(df['DATE'])
# find the latest datetime
latest_date = df['DATE'].max()
# use index filtering to only choose the columns that equal the latest date
latest_rows = df[df['DATE'] == latest_date]
print (latest_rows)
# now you can perform your operations on latest_rows
In my example, this will print:
DATE TRADE ID AVAILABLE STOCK
0 2016-10-11 123 123
1 2016-10-11 123 123
4 2016-10-11 123 123

Filter Excel Dataframe with Pandas

So I am doing some merged using Pandas using a name-map because the two files I want don't have exact name names to merge on easily. But My Pdata sheet has lists of dates from 2014 to 2016, but I want to filter the sheet down to only contain dates from 1/1/2015 - 31/12/2016.
Below is the code that I currently have and I am not sure how to/if I can filter on date before the merge.
import pandas as pd
path= 'C:/Users/Rukgo/Desktop/Match thing/'
name_map = pd.read_excel(path+'name_map.xls',sheetname=0)
Tdata = pd.read_excel(path+'2015_TXNs.xls',sheetname=0)
pdata = pd.read_excel(path+'Pipeline.xls', sheetname=0)
#pdata = pdata[(1/1/2015 <=pdata.date)&(pdata.date <=31/12/2015)]
merged = pd.merge(Tdata, name_map, how="left", on="Local Customer")
merged.to_excel(path+"results.xls")
mdata = pd.read_excel(path +'results.xls',sheetname=0)
final_merge = pd.merge(mdata, pdata, how='right', on='Client')
final_merge = final_merge[final_merge.Amount_USD !=0]
final_merge.to_excel(path+"Final Results.xls")
So I had a commented out section that ended up being quite close to the actual code that I needed.
pdata = pdata[(pdata['date']>='20150101')&(pdata['date']<='20151231')]
That ended up working perfectly, though hard codes the dates

Categories

Resources