I'm trying to get real prices for my data in pandas. Right now, I am just playing with one year's worth of data (3962050 rows) and it took me 443 seconds to inflate the values using the code below. Is there a quicker way to find the real value? Is it possible to use pooling? I have many more years and if would take too long to wait every time.
Portion of df:
year quarter fare
0 1994 1 213.98
1 1994 1 214.00
2 1994 1 214.00
3 1994 1 214.50
4 1994 1 214.50
import cpi
import pandas as pd
def inflate_column(data, column):
"""
Adjust for inflation the series of values in column of the
dataframe data. Using cpi library.
"""
print('Beginning to inflate ' + column)
start_time = time.time()
df = data.apply(lambda x: cpi.inflate(x[column],
x.year), axis=1)
print("Inflating process took", time.time() - start_time, " seconds to run")
return df
df['real_fare'] = inflate_column(df, 'fare')
You have multiple values for each year: you can just call one for every year, store it in dict and then use the value instead of calling to cpi.inflate everytime.
all_years = df["year"].unique()
dict_years = {}
for year in all_years:
dict_years[year] = cpi.inflate(1.0, year)
df['real_fare'] = # apply here: dict_years[row['year']]*row['fare']
You can fill the last line using apply, or try do it in some other way like df['real_fare']=df['fare']*...
Related
I have a three columns dataframe as follows. I want to calculate the returns in three months per day for every funds, so I need to get the date with recorded NAV data three months ago. Should I use the max() function with filter() function to deal this problem? If so, how? If not, could you please help me figure out a better way to do this?
fund code
date
NAV
fund 1
2021-01-04
1.0000
fund 1
2021-01-05
1.0001
fund 1
2021-01-06
1.0023
...
...
...
fund 2
2020-02-08
1.0000
fund 2
2020-02-09
0.9998
fund 2
2020-02-10
1.0001
...
...
...
fund 3
2022-05-04
2.0021
fund 3
2022-05-05
2.0044
fund 3
2022-05-06
2.0305
I tried to combined the max() function with filter() as follows:
max(filter(lambda x: x<=df['date']-timedelta(days=91)))
But it didn't work.
Were this in excel, I know I could use the following functions to solve this problem:
{max(if(B:B<=B2-91,B:B))}
{max(if(B:B<=B3-91,B:B))}
{max(if(B:B<=B4-91,B:B))}
....
But with python, I don't know what I could do. I just learnt it three days ago. Please help me.
This picture is what I want if it was in excel. The yellow area is the original data. The white part is the procedure I need for the calculation and the red part is the result I want. To get this result, I need to divide the 3rd column by the 5th column.
I know that I could use pct_change(period=7) function to get the same results in this picture. But here is the tricky part: the line 7 rows before is not necessarily the data 7 days before, and not all the funds are recorded daily. Some funds are recorded weekly, some monthly. So I need to check if the data used for division exists first.
what you need is an implementation of the maximum in sliding window (for your example 1 week, 7days).
I could recreated you example as follow (to create the data frame you have):
import pandas as pd
import datetime
from random import randint
df = pd.DataFrame(columns=["fund code", "date", "NAV"])
date = datetime.datetime.strptime("2021-01-04", '%Y-%m-%d')
for i in range(10):
df = df.append({"fund code": 'fund 1', "date": date + datetime.timedelta(i), "NAV":randint(0,10)}, ignore_index=True)
for i in range(20, 25):
df = df.append({"fund code": 'fund 1', "date": date + datetime.timedelta(i), "NAV":randint(0,10)}, ignore_index=True)
for i in range(20, 25):
df = df.append({"fund code": 'fund 2', "date": date + datetime.timedelta(i), "NAV":randint(0,10)}, ignore_index=True)
this will look like your example, with not continuous dates and two different funds.
The maximum sliding window (for variable days length look like this)
import queue
class max_queue:
def __init__(self, win=7):
self.win = win
self.queue = queue.deque()
self.date = None
def append(self, date, value):
while self.queue and value > self.queue[-1][1]:
self.queue.pop()
while self.queue and date - self.queue[0][0] >= datetime.timedelta(self.win):
self.queue.popleft()
self.queue.append((date, value))
self.date = date
def get_max(self):
return self.queue[0][1]
now you could simply iterate over rows and get the max value in the timeframe you are interested.
mq = max_queue(7)
pre_code = ''
for idx, row in df.iterrows():
code, date, nav,*_ = row
if code != pre_code:
mq = max_queue(7)
pre_code = code
mq.append(date, nav)
df.at[idx, 'max'] = mq.get_max()
results will look like this, with added max column. This assumes that funds data are continuous, but you could as well modify to have seperate max_queue for each funds as well.
using max queue to only keep track of the max in the window would be the correct complexity O(n) for a solution. important if you are dealing with huge datasets and especially bigger date ranges (instead of week).
Running this code produces the error message :
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead
I have 6 years' worth of competitor's results from a 1/2 marathon in one csv file.
The function year_runners aims to create a new column for each year with a difference in finishing time between each runner.
Is there a more efficient way of producing the same result?
Thanks in advance.
Pos Gun_Time Chip_Time Name Number Category
1 1900-01-01 01:19:15 1900-01-01 01:19:14 Steve Hodges 324 Senior Male
2 1900-01-01 01:19:35 1900-01-01 01:19:35 Theo Bately 92 Supervet Male
#calculating the time difference between each finisher in year and adding this result to into a new column called time_diff
def year_runners(year, x, y):
print('Event held in', year)
# x is the first number (position) for the runner of that year,
# y is the last number (position) for that year e.g. 2016 event spans from df[246:534]
time_diff = 0
#
for index, row in df.iterrows():
time_diff = df2015.loc[(x + 1),'Gun_Time'] - df2015.loc[(x),'Chip_Time']
# using Gun time as the start-time for all.
# using chip time as finishing time for each runner.
# work out time difference between the x-placed runner and the runner behind (x + 1)
df2015.loc[x,'time_diff'] = time_diff #set the time_diff column to the value of time_diff for
#each row of x in the dataframe
print("Runner",(x+1),"time, minus runner" , x,"=",time_diff)
x += 1
if x > y:
break
Hi everyone, this was solved using the shift technique.
youtube.com/watch?v=nZzBj6n_abQ
df2015['shifted_Chip_Time'] = df2015['Chip_Time'].shift(1)
df2015['time_diff'] = df2015['Gun_Time'] - df2015['shifted_Chip_Time']
I have a .txt file with three columns: Time, ticker, price. The time is spaced in 15 second intervals. It looks like this uploaded to jupyter notebook and put into a Pandas DF.
time ticker price
0 09:30:35 EV 33.860
1 00:00:00 AMG 60.430
2 09:30:35 AMG 60.750
3 00:00:00 BLK 455.350
4 09:30:35 BLK 451.514
... ... ... ...
502596 13:00:55 TLT 166.450
502597 13:00:55 VXX 47.150
502598 13:00:55 TSLA 529.800
502599 13:00:55 BIDU 103.500
502600 13:00:55 ON 12.700
# NOTE: the first set of data has the data at market open for -
# every other time point, so that's what the 00:00:00 is.
#It is only limited to the 09:30:35 data.
I need to create a function that takes an input (a ticker) and then creates a bar chart that displays the data with 5 minute ticks ( the data is every 20 seconds, so for every 15 points in time).
So far I've thought about separating the "mm" part of the hh:mm:ss to just get the minutes in another column and then right a for loop that looks something like this:
for num in df['mm']:
if num %5 == 0:
print('tick')
then somehow appending the "tick" to the "time" column for every 5 minutes of data (I'm not sure how I would do this), then using the time column as the index and only using data with the "tick" index in it (some kind of if statement). I'm not sure if this makes sense but I'm drawing a blank on this.
You should have a look at the built-in functions in pandas. In the following example I'm using a date + time format but it shouldn't be hard to convert one to the other.
Generate data
%matplotlib inline
import pandas as pd
import numpy as np
dates = pd.date_range(start="2020-04-01", periods=150, freq="20S")
df1 = pd.DataFrame({"date":dates,
"price":np.random.rand(len(dates))})
df2 = df1.copy()
df1["ticker"] = "a"
df2["ticker"] = "b"
df = pd.concat([df1,df2], ignore_index=True)
df = df.sample(frac=1).reset_index(drop=True)
Resample Timeseries every 5 minutes
Here you can try to see the output of
df1.set_index("date")\
.resample("5T")\
.first()\
.reset_index()
Where we are considering just the first element at 05:00, 10:00 and so on. In general to do the same for every ticker we need a groupby
out = df.groupby("ticker")\
.apply(lambda x: x.set_index("date")\
.resample("5T")\
.first()\
.reset_index())\
.reset_index(drop=True)
Plot function
def plot_tick(data, ticker):
ts = data[data["ticker"]==ticker].reset_index(drop=True)
ts.plot(x="date", y="price", kind="bar", title=ticker);
plot_tick(out, "a")
Then you can improve the plot or, eventually, try to use plotly.
I fully understand there are a few versions of this questions out there, but none seem to get at the core of my problem. I have a pandas Dataframe with roughly 72,000 rows from 2015 to now. I am using a calculation that finds the most impactful words for a given set of text (tf_idf). This calculation does not account for time, so I need to break my main Dataframe down into time-based segments, ideally every 15 and 30 days (or n days really, not week/month), then run the calculation on each time-segmented Dataframe in order to see and plot what words come up more and less over time.
I have been able to build part of this this out semi-manually with the following:
def dateRange():
start = input("Enter a start date (MM-DD-YYYY) or '30' for last 30 days: ")
if (start != '30'):
datetime.strptime(start, '%m-%d-%Y')
end = input("Enter a end date (MM-DD-YYYY): ")
datetime.strptime(end, '%m-%d-%Y')
dataTime = data[(data['STATUSDATE'] > start) & (data['STATUSDATE'] <= end)]
else:
dataTime = data[data.STATUSDATE > datetime.now() - pd.to_timedelta('30day')]
return dataTime
dataTime = dateRange()
dataTime2 = dateRange()
def calcForDateRange(dateRangeFrame):
##### LONG FUNCTION####
return word and number
calcForDateRange(dataTime)
calcForDateRange(dataTime2)
This works - however, I have to manually create the 2 dates which is expected as I created this as a test. How can I split the Dataframe by increments and run the calculation for each dataframe?
dicts are allegedly the way to do this. I tried:
dict_of_dfs = {}
for n, g in data.groupby(data['STATUSDATE']):
dict_of_dfs[n] = g
for frame in dict_of_dfs:
calcForDateRange(frame)
The dict result was 2015-01-02: Dataframe with no frame. How can I break this down into a 100 or so Dataframes to run my function on?
Also, I do not fully understand how to break down ['STATUSDATE'] by number of days specifically?
I would to avoid iterating as much as possible, but I know I probably will have to someehere.
THank you
Let us assume you have a data frame like this:
date = pd.date_range(start='1/1/2018', end='31/12/2018', normalize=True)
x = np.random.randint(0, 1000, size=365)
df = pd.DataFrame(x, columns = ["X"])
df['Date'] = date
df.head()
Output:
X Date
0 328 2018-01-01
1 188 2018-01-02
2 709 2018-01-03
3 259 2018-01-04
4 131 2018-01-05
So this data frame has 365 rows, one for each day of the year.
Now if you want to group this data into intervals of 20 days and assign each group to a dict, you can do the following
df_dict = {}
for k,v in df.groupby(pd.Grouper(key="Date", freq='20D')):
df_dict[k.strftime("%Y-%m-%d")] = pd.DataFrame(v)
print(df_dict)
How about something like this. It creates a dictionary of non empty dataframes keyed on the
starting date of the period.
import datetime as dt
start = '12-31-2017'
interval_days = 30
start_date = pd.Timestamp(start)
end_date = pd.Timestamp(dt.date.today() + dt.timedelta(days=1))
dates = pd.date_range(start=start_date, end=end_date, freq=f'{interval_days}d')
sub_dfs = {d1.strftime('%Y%m%d'): df.loc[df.dates.ge(d1) & df.dates.lt(d2)]
for d1, d2 in zip(dates, dates[1:])}
# Remove empty dataframes.
sub_dfs = {k: v for k, v in sub_dfs.items() if not v.empty}
I am trying to take yearly max rainfall data for multiple years of data within one array. I understand how you would need to use a for loop if I wanted to take the max of a single range, I saw there was similar question to the problem I'm having. However, I need to take leap year into account!
So for the first year I have 14616 data points from 1960-1965, not including 1965, which contains 2 leap years: 1960 and 1964. A leap year contains 2928 data points and every other year contains 2920 data points.
I first thought was to modify the solution from the similar question which involved using a for loop as follows (just a straight copy paste from their's):
for i,d in enumerate(data_you_want):
if (i % 600) == 0:
avg_for_day = np.mean(data_you_want[i - 600:i])
daily_averages.append(avg_for_day)
Their's involved taking the average of every 600 lines in their data. I thought there might be a way to just modify this, but I couldn't figure out a way for it to work. If modification of this won't work, is there another way to loop it with the leap years taken into account without completely cutting up the file manually.
Fake data:
import numpy as np
fake = np.random.randint(2, 30, size = 14616)
Use pandas to handle the leap year functionality.
Create timestamps for your data with pandas.date_range().
import pandas as pd
index = pd.date_range(start = '1960-1-1 00:00:00', end = '1964-12-31 23:59:59' , freq='3H')
Then create a DataFrame using the timestamps for the index.
df = pd.DataFrame(data = fake, index = index)
Aggregate by year - taking advantage of the DatetimeIndex flexibilty.
>>> df['1960'].max()
0 29
dtype: int32
>>> df['1960'].mean()
0 15.501366
dtype: float64
>>>
>>> len(df['1960'])
2928
>>> len(df['1961'])
2920
>>> len(df['1964'])
2928
>>>
I just cobbled this together from the Time Series / Date functionality section of the docs. Given pandas capability this looks a bit naive and probably can be improved upon.
Like resampling (using the same DataFrame)
>>> df.resample('A').mean()
0
1960-12-31 15.501366
1961-12-31 15.170890
1962-12-31 15.412329
1963-12-31 15.538699
1964-12-31 15.382514
>>> df.resample('A').max()
0
1960-12-31 29
1961-12-31 29
1962-12-31 29
1963-12-31 29
1964-12-31 29
>>>
>>> r = df.resample('A')
>>> r.agg([np.sum, np.mean, np.std])
0
sum mean std
1960-12-31 45388 15.501366 8.211835
1961-12-31 44299 15.170890 8.117072
1962-12-31 45004 15.412329 8.257992
1963-12-31 45373 15.538699 7.986877
1964-12-31 45040 15.382514 8.178057
>>>
Food for thought:
Time-aware Rolling vs. Resampling