sample table here
i am trying to look up corresponding commodity prices from columns(CU00.SHF,AU00.SHF,SC00.SHF,I8888.DCE C00.DCE), with a new set of timestamps, the dates of which are 32 days later than the dates in column 'history_date'.
i tried .loc and .at in a loop to extract the matching values with below functions:
latest_day = data.iloc[data.shape[0] - 1, 0].date()
def next_trade_day(x):
x = pd.to_datetime(x).date() #imported is_workday funtion requires datetime type
while True:
if is_workday(x + timedelta(32)) != False:
break
return (pd.Timestamp((x + timedelta(32))))
if is_workday(x + timedelta(32)) == False:
x = x + timedelta(1)
return pd.Timestamp(x + timedelta(32))
def end_price(x):
x = pd.Timestamp(x)
if x <= latest_day:
return data.at[x,'CU00.SHF']
if x > latest_day:
return'None'
return data.at[x,'CU00.SHF']
but it always gives
KeyError: Timestamp('2023-02-03 00:00:00')
any idea how should i achieve the target?
thanks in advance!
if you want work datetime:
convert column datetime
check date converted, use filte
pd.to_datetime(df['your column'],errors='ignore')
df.loc[df.['your column'] > 'your-date' ]
if work both, then check your full code.
I am working on a personal project collecting the data on Covid-19 cases. The data set only shows the total number of Covid-19 cases per state cumulatively. I would like to add a column that contains the new cases added that day. This is what I have so far:
import pandas as pd
from datetime import date
from datetime import timedelta
import numpy as np
#read the CSV from github
hist_US_State = pd.read_csv("https://raw.githubusercontent.com/nytimes/covid-19-data/master/us-states.csv")
#some code to get yesterday's date and the day before which is needed later.
today = date.today()
yesterday = today - timedelta(days = 1)
yesterday = str(yesterday)
day_before_yesterday = today - timedelta(days = 2)
day_before_yesterday = str(day_before_yesterday)
#Extracting yesterday's and the day before cases and combine them in one dataframe
yesterday_cases = hist_US_State[hist_US_State["date"] == yesterday]
day_before_yesterday_cases = hist_US_State[hist_US_State["date"] == day_before_yesterday]
total_cases = pd.DataFrame()
total_cases = day_before_yesterday_cases.append(yesterday_cases)
#Adding a new column called "new_cases" and this is where I get into trouble.
total_cases["new_cases"] = yesterday_cases["cases"] - day_before_yesterday_cases["cases"]
Can you please point out what I am doing wrong?
Because you defined total_cases as a concatenation (via append) of yesterday_cases and day_before_yesterday_cases, its number of rows is equal to the sum of the other two dataframes. It looks like yesterday_cases and day_before_yesterday_cases both have 55 rows, and so total_cases has 110 rows. Thus your last line is trying to assign 55 values to a series of 110 values.
You may either want to reshape your data so that each date is its own column, or work in arrays of dataframes.
I am fairly new to python and coding in general.
I have a big data file that provides daily data for the period 2011-2018 for a number of stock tickers (300~).
The data is a .csv file with circa 150k rows and looks as follows (short example):
Date,Symbol,ShortExemptVolume,ShortVolume,TotalVolume
20110103,AAWW,0.0,28369,78113.0
20110103,AMD,0.0,3183556,8095093.0
20110103,AMRS,0.0,14196,18811.0
20110103,ARAY,0.0,31685,77976.0
20110103,ARCC,0.0,177208,423768.0
20110103,ASCMA,0.0,3930,26527.0
20110103,ATI,0.0,193772,301287.0
20110103,ATSG,0.0,23659,72965.0
20110103,AVID,0.0,7211,18896.0
20110103,BMRN,0.0,21740,213974.0
20110103,CAMP,0.0,2000,11401.0
20110103,CIEN,0.0,625165,1309490.0
20110103,COWN,0.0,3195,24293.0
20110103,CSV,0.0,6133,25394.0
I have a function that allows me to filter for a specific symbol and get 10 observations before and after a specified date (could be any date between 2011 and 2018).
import pandas as pd
from datetime import datetime
import urllib
import datetime
def get_data(issue_date, stock_ticker):
df = pd.read_csv (r'D:\Project\Data\Short_Interest\exampledata.csv')
df['Date'] = pd.to_datetime(df['Date'], format="%Y%m%d")
d = df
df = pd.DataFrame(d)
short = df.loc[df.Symbol.eq(stock_ticker)]
# get the index of the row of interest
ix = short[short.Date.eq(issue_date)].index[0]
# get the item row for that row's index
iloc_ix = short.index.get_loc(ix)
# get the +/-1 iloc rows (+2 because that is how slices work), basically +1 and -1 trading days
short_data = short.iloc[iloc_ix-10: iloc_ix+11]
return [short_data]
I want to create a script that iterates a list of 'issue_dates' and 'stock_tickers'. The list (a .csv) looks as following:
ARAY,07/08/2017
ARAY,24/04/2014
ACETQ,16/11/2015
ACETQ,16/11/2015
NVLNA,15/08/2014
ATSG,29/09/2017
ATI,24/05/2016
MDRX,18/06/2013
MDRX,18/06/2013
AMAGX,10/05/2017
AMAGX,14/02/2014
AMD,14/09/2016
To break down my problem and question I would like to know how to do the following:
First, how do I load the inputs?
Second, how do I call the function on each input?
And last, how do I accumulate all the function returns in one dataframe?
To load the inputs and call the function for each row; iterate over the csv file and pass each row's values to the function and accumulate the resulting Seriesin a list.
I modified your function a bit: removed the DataFrame creation so it is only done once and added a try/except block to account for missing dates or tickers (your example data didn't match up too well). The dates in the second csv look like they are day/month/year so I converted them for that format.
import pandas as pd
import datetime, csv
def get_data(df, issue_date, stock_ticker):
'''Return a Series for the ticker centered on the issue date.
'''
short = df.loc[df.Symbol.eq(stock_ticker)]
# get the index of the row of interest
try:
ix = short[short.Date.eq(issue_date)].index[0]
# get the item row for that row's index
iloc_ix = short.index.get_loc(ix)
# get the +/-1 iloc rows (+2 because that is how slices work), basically +1 and -1 trading days
short_data = short.iloc[iloc_ix-10: iloc_ix+11]
except IndexError:
msg = f'no data for {stock_ticker} on {issue_date}'
#log.info(msg)
print(msg)
short_data = None
return short_data
df = pd.read_csv (datafile)
df['Date'] = pd.to_datetime(df['Date'], format="%Y%m%d")
results = []
with open('issues.csv') as issues:
for ticker,date in csv.reader(issues):
day,month,year = map(int,date.split('/'))
# dt = datetime.datetime.strptime(date, r'%d/%m/%Y')
date = datetime.date(year,month,day)
s = get_data(df,date,ticker)
results.append(s)
# print(s)
Creating a single DataFrame or table for all that info may be problematic especially since the date ranges are all different. Probably should ask a separate question regarding that. Its mcve should probably just include a few minimal Pandas Series with a couple of different date ranges and tickers.
I want to add a new column to a pandas df which will be calculated based on another column.
Here's a shourtcut of the df:
If the date is between start_date1 and end_date1 it should output in the period column "0". If the date is between start_date2 and end_date2 output a "1" and so on.
Is there any way to do this without a loop?
Thanks for your help :)
Larry
First of all you need to check if your column date has a datetime format.
You can check this with df.dtypes. If it does not have the format date (datetime64), you have to convert it to datetime with:
df['date'] = pd.to_datetime(df.date, format='%Y%m%d', errors='ignore')
Please note that the argument errors='ignore' has its risks, so its optional.
Now to make your calculated field, you can do this:
# define function to calculate periods based on date
def calculate_period(row):
if row['date'] > start_date1 & row['date'] < end_date1:
return "0"
elif row['date'] > start_date2 & row['date'] < end_date2:
return "1"
elif row['date'] > start_date3 & row['date'] < end_date3:
return "2"
else:
return "unknown"
# apply function to create the new column
df['period'] = df.apply(calculate_period, axis=1)
If you need more period values, you can extend the elif statements as you like.
Since there is lack of information about your data. I assumed that start_date1 and end_date1 are variables you defined.
If these are columns as well. The function would look like this:
# define function to calculate periods based on date
def calculate_period(row):
if row['date'] > row['start_date1'] & row['date'] < row['end_date1']:
return "0"
elif row['date'] > row['start_date2'] & row['date'] < row['end_date2']:
return "1"
elif row['date'] > row['start_date3'] & row['date'] < row['end_date3']:
return "2"
else:
return "unknown"
# apply function to create the new column
df['period'] = df.apply(calculate_period, axis=1)
Good luck.
I just wrote this function to calculated the age's person based in two columns in a Python DataFrame. Unfortunately, if a use the return the function return the same value for all rows, but if I use the print statement the function gives me the right values.
Here is the code:
def calc_age(dataset):
index = dataset.index
for element in index:
year_nasc = train['DT_NASCIMENTO_BENEFICIARIO'][element][6:]
year_insc = train['ANO_CONCESSAO_BOLSA'][element]
age = int(year_insc) - int(year_nasc)
print ('Age: ', age)
#return age
train['DT_NASCIMENTO_BENEFICIARIO'] = 03-02-1987
train['ANO_CONCESSAO_BOLSA'] = 2009
What am I doing wrong?!
If what you want is to subtract the year of DT_NASCIMENTO_BENEFICIARIO from ANO_CONCESSAO_BOLSA, and df is your DataFrame:
# cast to datetime
df["DT_NASCIMENTO_BENEFICIARIO"] = pd.to_datetime(df["DT_NASCIMENTO_BENEFICIARIO"])
df["age"] = df["ANO_CONCESSAO_BOLSA"] - df["DT_NASCIMENTO_BENEFICIARIO"].dt.year
# print the result, or do something else with it:
print(df["age"])