First occurence with Datetime index - python

I have a DateTime indexed data frame called Rolling_Max. It gives the past 2520-day max (price) for each day from 1/1/11 to 1/1/21. How do I add a column to it which has the date each maximum first appeared?
dfsp = pdr.get_data_yahoo('^GSPC', start='2011-01-01', end='2021-01-01')['Adj Close']
window = 2520
Roll_Max = dfsp.rolling(window, min_periods=1).max()
The second row should have another column saying "2011-01-03"

You can try this, assuming price is a pd.Series
price_argmax = price.reset_index(drop=True).rolling(2520, min_periods=130).apply(lambda x: x.index[x.argmax()])
price_argmax = pd.Series(price.index[price_argmax.dropna().astype("int")], index = price.index[price_argmax.dropna().index]).reindex(price.index)
If you need this to be fast then you might want to write a loop with numba/numpy.

Related

How to get mean values per specific time range pandas

I want to calculate the average temperature per day within an specific time frame, for example average temperature on the date = 2021-03-11 between time = 14:23:00 and 16:53:00. I tried with a pivot_table the following:
in_range_df = pd.pivot_table(df, values=['Air_temp'],
index=['Date','Time'],
aggfunc={'Air_temp':np.mean})
The dataframe looks like this:
But how can I specify the time range now? Any help is appreciated.
Usually you can use between :
df[df["Datum"].between('2021-03-11 14:23:00',
'2021-03-11 16:53:00',)]["Air_temp"].mean()
First you should convert Datum column to datetime type:
df['Datum'] = pd.to_datetime(df['Datum'])
Then you can use df.loc like this:
t1 = '2021-03-11 14:23:00'
t2 = '2021-03-11 16:53:00'
df.loc[(df['Datum'] > t1) & (df['Datum'] < t2)]['Air_tmp'].mean()

Column name disappears after resampling dataframe

I have a dataframe in which i need to calculate the means of ozone values for every 8 hours. The problem is that the column after which i am doing the resampling('readable time') disappears and cannot be referenced after the resampling.
import pandas as pd
data = pd.read_csv("o3_new.csv")
del data['latitude']
del data['longitude']
del data['altitude']
sensor_name = "o3"
data['readable time'] = pd.to_datetime(data['readable time'], dayfirst=True)
data = data.resample('480min', on='readable time').mean() # 8h mean
data[str(sensor_name) + "_aqi"] = ""
for i in range(len(data)):
data[str(sensor_name) + "_aqi"][i] = calculate_aqi(sensor_name, data[sensor_name][i])
print(data['readable time']) #throws KeyError
Where o3_new.csv is like this:
,time,latitude,longitude,altitude,o3,readable time,day
0,1591037392,45.645893,25.599471,576.38,39.4,1/6/2020 21:49,1/6/2020
1,1591037452,45.645893,25.599471,576.64,48.4,1/6/2020 21:50,1/6/2020
2,1591037512,45.645893,25.599471,576.56,53.4,1/6/2020 21:51,1/6/2020
3,1591037572,45.645893,25.599471,576.64,36.4,1/6/2020 21:52,1/6/2020
4,1591037632,45.645893,25.599471,576.73,50.4,1/6/2020 21:53,1/6/2020
5,1591037692,45.645893,25.599471,577.09,37.4,1/6/2020 21:54,1/6/2020
What to do to keep referencing the 'readable time' column after resampling?
What would you like the column to contain? mean makes no particularly good sense for time columns. Also, the resampler makes your on column the index, so just data.reset_index(inplace=True) may make you happy.
Or you can use data.index to access the values still, directly after the resample

How do I make this function iterable (getting indexerror)

I am fairly new to python and coding in general.
I have a big data file that provides daily data for the period 2011-2018 for a number of stock tickers (300~).
The data is a .csv file with circa 150k rows and looks as follows (short example):
Date,Symbol,ShortExemptVolume,ShortVolume,TotalVolume
20110103,AAWW,0.0,28369,78113.0
20110103,AMD,0.0,3183556,8095093.0
20110103,AMRS,0.0,14196,18811.0
20110103,ARAY,0.0,31685,77976.0
20110103,ARCC,0.0,177208,423768.0
20110103,ASCMA,0.0,3930,26527.0
20110103,ATI,0.0,193772,301287.0
20110103,ATSG,0.0,23659,72965.0
20110103,AVID,0.0,7211,18896.0
20110103,BMRN,0.0,21740,213974.0
20110103,CAMP,0.0,2000,11401.0
20110103,CIEN,0.0,625165,1309490.0
20110103,COWN,0.0,3195,24293.0
20110103,CSV,0.0,6133,25394.0
I have a function that allows me to filter for a specific symbol and get 10 observations before and after a specified date (could be any date between 2011 and 2018).
import pandas as pd
from datetime import datetime
import urllib
import datetime
def get_data(issue_date, stock_ticker):
df = pd.read_csv (r'D:\Project\Data\Short_Interest\exampledata.csv')
df['Date'] = pd.to_datetime(df['Date'], format="%Y%m%d")
d = df
df = pd.DataFrame(d)
short = df.loc[df.Symbol.eq(stock_ticker)]
# get the index of the row of interest
ix = short[short.Date.eq(issue_date)].index[0]
# get the item row for that row's index
iloc_ix = short.index.get_loc(ix)
# get the +/-1 iloc rows (+2 because that is how slices work), basically +1 and -1 trading days
short_data = short.iloc[iloc_ix-10: iloc_ix+11]
return [short_data]
I want to create a script that iterates a list of 'issue_dates' and 'stock_tickers'. The list (a .csv) looks as following:
ARAY,07/08/2017
ARAY,24/04/2014
ACETQ,16/11/2015
ACETQ,16/11/2015
NVLNA,15/08/2014
ATSG,29/09/2017
ATI,24/05/2016
MDRX,18/06/2013
MDRX,18/06/2013
AMAGX,10/05/2017
AMAGX,14/02/2014
AMD,14/09/2016
To break down my problem and question I would like to know how to do the following:
First, how do I load the inputs?
Second, how do I call the function on each input?
And last, how do I accumulate all the function returns in one dataframe?
To load the inputs and call the function for each row; iterate over the csv file and pass each row's values to the function and accumulate the resulting Seriesin a list.
I modified your function a bit: removed the DataFrame creation so it is only done once and added a try/except block to account for missing dates or tickers (your example data didn't match up too well). The dates in the second csv look like they are day/month/year so I converted them for that format.
import pandas as pd
import datetime, csv
def get_data(df, issue_date, stock_ticker):
'''Return a Series for the ticker centered on the issue date.
'''
short = df.loc[df.Symbol.eq(stock_ticker)]
# get the index of the row of interest
try:
ix = short[short.Date.eq(issue_date)].index[0]
# get the item row for that row's index
iloc_ix = short.index.get_loc(ix)
# get the +/-1 iloc rows (+2 because that is how slices work), basically +1 and -1 trading days
short_data = short.iloc[iloc_ix-10: iloc_ix+11]
except IndexError:
msg = f'no data for {stock_ticker} on {issue_date}'
#log.info(msg)
print(msg)
short_data = None
return short_data
df = pd.read_csv (datafile)
df['Date'] = pd.to_datetime(df['Date'], format="%Y%m%d")
results = []
with open('issues.csv') as issues:
for ticker,date in csv.reader(issues):
day,month,year = map(int,date.split('/'))
# dt = datetime.datetime.strptime(date, r'%d/%m/%Y')
date = datetime.date(year,month,day)
s = get_data(df,date,ticker)
results.append(s)
# print(s)
Creating a single DataFrame or table for all that info may be problematic especially since the date ranges are all different. Probably should ask a separate question regarding that. Its mcve should probably just include a few minimal Pandas Series with a couple of different date ranges and tickers.

How to change value into date format and How to change value into time format

I have two columns one with values that represents time and another with values that represent a date (both values are in floating type), I have the following data in each column:
df['Time']
540.0
630.0
915.0
1730.0
2245.0
df['Date']
14202.0
14202.0
14203.0
14203.0
I need to create new columns with the correct data format for these two columns, to be able to analyze data with date and time in distinct columns.
For ['Time'] I need to convert the format to:
540.0 = 5h40 OR TO 5.40 am
2245.0 = 22h45 OR TO 10.45 pm
For ['Date'], I need to convert the format to:
Each number we can say that represent "days":
where 0 ("days") = 01-01-1980
So if I add 01-01-1980 to 14202.0 = 18-11-1938
and if I add: 01-01-1980 + 14203.0 = 19-11-1938,
this way is possible to do with excel but I need a way to do in Python.
I tried different types of code but nothing works, for example, one of the codes that I tried was the one below:
# creating a variable with the data in column ['Date'] adding the days into the date:
Time1 = pd.to_datetime(df["Date"])
# When I print it is possible to see that 14203 in row n.55384 is added at the end of the date created but including time, and is not what I want:
print(Time1.loc[[55384]])
55384 1970-01-01 00:00:00.000014203
Name: Date, dtype: datetime64[ns]
# printing the same row (55384) to check the value 14203.0, that was added above:
print(df["Date"].loc[[55384]])
55384 14203.0
Name: Date, dtype: float64
For ['Time'] I have the same problem I can't have time without a date, I also tried to insert ':', but is not working even converting the data type to string.
I hope that someone can help me with this matter, and any doubt please let me know, sometimes is not easy to explain.
regarding the time conversion:
# change to integer
tt= [int(i) for i in df['Time']]
# convert to time
time_ = pd.to_datetime(tt,format='%H%M').time
# convert from 24 hour, to 12 hour time format
[t.strftime("%I:%M %p") for t in time_]
Solving problems with Date
from datetime import datetime
from datetime import timedelta
startdate_string = "1980/01/01" #defining start date in string format
startdate_object = datetime.strptime(startdate_string, "%Y/%m/%d").date() # changing string format date, to date object using strptime function
startdate_object # print startdate_object to check date
creating a list to add in the dataframe a new column with date format
import math
datenew = []
dates = df['UTS_Date'] # data from the original column 'UTS_Date'
for values in dates: # using an if statement to accept null values and appending them into the new list
if math.isnan(values):
`datenew.append('NaN')`
`continue `
`currentdate1 = startdate_object + timedelta(days= float(values))` # add the reference data (startdate_object) to a delta (which is the value in each row of the column)
`datenew.append(str(currentdate1)) ` # converte data into string format and add in the end of the list, removing any word from the list (such: datetime.date)
print (len(datenew)) # check the length of the new list datenew, to ensure that all rows on the data are in the new list
df.insert(3, 'Date', datenew) #creating a new column in data frame for date format
solving problems with Time
timenew = [] # creating a new list
times = df['Time'] # variable times is equal to the column df['Time'] of the dataframe
variable to find the location of time that is >= 2400
i = 0
def Normalize_time (val):
`offset = 0`
`if val >= 2400:`
`offset = 1 `
# converting val into integer, to remove decimal places
hours = int(val / 100)
# remove hours and remain just with minutes
minutes = int(val) - hours * 100
# to convert every rows above 24h
hours = (hours%23) - offset
# zfill recognizes that it must have two characters (in this case) for hours and minutes
# and if there aren't enough characters,
# it will add by padding zeros on the left until reaching the number of characters in the argument
return str(hours).zfill(2) + ':' + str(minutes).zfill(2)
creating a for statement to add all the values in the new list, using 'function Normalize_time()'
for values in times:
# using an if statement to accept null values and appending them into the new list
if math.isnan(values):
`timenew.append('NaN') `
` continue `
# using values into the function 'Normalize_time()'
timestr = Normalize_time(values)
# appending each value in the new list
timenew.append(timestr)
print(len(timenew)) # check the length of new list timenew, to ensure that all rows on the data are in the new list
df.insert(4, 'ODTime', timenew) #creating a new column in data frame

Pandas and stocks: From daily values (in columns) to monthly values (in rows)

I am having trouble reformatting a dataframe.
My input is a day value rows by symbols columns (each symbol has different dates with it's values):
Input
code to generate input
data = [("01-01-2010", 15, 10), ("02-01-2010", 16, 11), ("03-01-2010", 16.5, 10.5)]
labels = ["date", "AAPL", "AMZN"]
df_input = pd.DataFrame.from_records(data, columns=labels)
The needed output is (month row with new row for each month):
Needed output
code to generate output
data = [("01-01-2010","29-01-2010", "AAPL", 15, 20), ("01-01-2010","29-01-2010", "AMZN", 10, 15),("02-02-2010","30-02-2010", "AAPL", 20, 32)]
labels = ['bd start month', 'bd end month','stock', 'start_month_value', "end_month_value"]
df = pd.DataFrame.from_records(data, columns=labels)
Meaning (Pseudo code)
1. for each row take only non nan values to create a new "row" (maybe dictionary with the date as the index and the [stock, value] as the value.
2. take only rows that are business start of month or business end of month.
3. write those rows to a new datatframe.
I have read several posts like this and this and several more.
All treat with dataframe of the same "type" and just resampling while I need to change to structure...
My code so far
# creating the new index with business days
df1 =pd.DataFrame(range(10000), index = pd.date_range(df.iloc[0].name, periods=10000, freq='D'))
from pandas.tseries.offsets import CustomBusinessMonthBegin
from pandas.tseries.holiday import USFederalHolidayCalendar
bmth_us = CustomBusinessMonthBegin(calendar=USFederalHolidayCalendar())
df2 = df1.resample(bmth_us).mean()
# creating the new index interseting my old one (daily) with the monthly index
new_index = df.index.intersection(df2.index)
# selecting only the rows I want
df = df.loc[new_index]
# creating a dict that will be my new dataset
new_dict = collections.OrderedDict()
# iterating over the rows and adding to dictionary
for index, row in df.iterrows():
# print index
date = df.loc[index].name
# values are the not none values
values = df.loc[index][~df.loc[index].isnull().values]
new_dict[date]=values
# from dict to list
data=[]
for key, values in new_dict.iteritems():
for i in range(0, len(values)):
date = key
stock_name = str(values.index[i])
stock_value = values.iloc[i]
row = (key, stock_name, stock_value)
data.append(row)
# from the list to df
labels = ['date','stock', 'value']
df = pd.DataFrame.from_records(data, columns=labels)
df.to_excel("migdal_format.xls")
Current output I get
One big problem:
I only get value of the stock on the start of month day.. I need start and end so I can calculate the stock gain on this month..
One smaller problem:
I am sure this is not the cleanest and fastest code :)
Thanks a lot!
So I have found a way.
looping through each column
groupby month
taking the first and last value I have in that month
calculate return
df_migdal = pd.DataFrame()
for col in df_input.columns[0:]:
stock_position = df_input.loc[:,col]
name = stock_position.name
name = re.sub('[^a-zA-Z]+', '', name)
name = name[0:-4]
stock_position=stock_position.groupby([pd.TimeGrouper('M')]).agg(['first', 'last'])
stock_position["name"] = name
stock_position["return"] = ((stock_position["last"] / stock_position["first"]) - 1) * 100
stock_position.dropna(inplace=True)
df_migdal=df_migdal.append(stock_position)
df_migdal=df_migdal.round(decimals=2)
I tried I way cooler way, but did not know how to handle the ,multi index I got... I needed that for each column, to take the two sub columns and create a third one from some lambda function.
df_input.groupby([pd.TimeGrouper('M')]).agg(['first', 'last'])

Categories

Resources