Using the ff_monthly.csv data set https://github.com/alexpetralia/fama_french,
use the first column as an index
(this contains the year and month of the data as a string
Create a new column ‘Mkt’ as ‘Mkt-RF’ + ‘RF’
Create two new columns in the loaded DataFrame, ‘Month’ and ‘Year’ to
contain the year and month of the dataset extracted from the index column.
Create a new DataFrame with columns ‘Mean’ and ‘Standard
Deviation’ and the full set of years from (b) above.
Write a function which accepts (r_m,s_m) the monthy mean and standard
deviation of a return series and returns a tuple (r_a,s_a), the annualised
mean and standard deviation. Use the formulae: r_a = (1+r_m)^12 -1, and
s_a = s_m * 12^0.5.
Loop through each year in the data, and calculate the annualised mean and
standard deviation of the new ‘Mkt’ column, storing each in the newly
created DataFrame. Note that the values in the input file are % returns, and
need to be divided by 100 to return decimals (i.e the value for August 2022
represents a return of -3.78%).
. Print the DataFrame and output it to a csv file.
Workings so far:
import pandas as pd
ff_monthly=pd.read_csv(r"file path")
ff_monthly=pd.read_csv(r"file path",index_col=0)
Mkt=ff_monthly['Mkt-RF']+ff_monthly['RF']
ff_monthly= ff_monthly.assign(Mkt=Mkt)
df=pd.DataFrame(ff_monthly)
enter image description here
There are a few things to pay attention to.
The Date is the index of your DataFrame. This is treated in a special way compared to the normal columns. This is the reason df.Date gives an Attribute error. Date is not an Attribute, but the index. Instead try df.index
df.Date.str.split("_", expand=True) would work if your Date would look like 22_10. However according to your picture it doesn't contain an underscore and also contains the day, so this cannot work
In fact the format you have is not even following any standard. In order to properly deal with that the best way would be parsing this to a proper datetime64[ns] type that pandas will understand with df.index = pd.to_datetime(df.index, format='%y%m%d'). See the python docu for supported format strings.
If all this works, it should be rather straightforward to create the columns
df['year'] = df.index.dt.year
In fact, this part has been asked before
Related
So I have this dataset of temperatures. Each line describe the temperature in celsius measured by hour in a day.
So, I need to compute a new variable called avg_temp_ar_mensal which representsthe average temperature of a city in a month. City in this dataset is represented as estacao and month as mes.
I'm trying to do this using pandas. The following line of code is the one I'm trying to use to solve this problem:
df2['avg_temp_ar_mensal'] = df2['temp_ar'].groupby(df2['mes', 'estacao']).mean()
The goal of this code is to store in a new column the average of the temperature of the city and month. But it doesn't work. If I try the following line of code:
df2['avg_temp_ar_mensal'] = df2['temp_ar'].groupby(df2['mes']).mean()
It will works, but it is wrong. It will calculate for every city of the dataset and I don't want it because it will cause noise in my data. I need to separate each temperature based on month and city and then calculate the mean.
The dataframe after groupby is smaller than the initial dataframe, that is why your code run into error.
There is two ways to solve this problem. The first one is using transform as:
df.groupby(['mes', 'estacao'])['temp_ar'].transform(lambda g: g.mean())
The second is to create a new dfn from groupby then merge back to df
dfn = df.groupby(['mes', 'estacao'])['temp_ar'].mean().reset_index(name='average')
df = pd.merge(df, dfn, on=['mes', 'estacao'], how='left']
You are calling a groupby on a single column when you are doing df2['temp_ar'].groupby(...). This doesn't make much sense since in a single column, there's nothing to group by.
Instead, you have to perform the groupby on all the columns you need. Also, make sure that the final output is a series and not a dataframe
df['new_column'] = df[['city_column', 'month_column', 'temp_column']].groupby(['city_column', 'month_column']).mean()['temp_column']
This should do the trick if I understand your dataset correctly. If not, please provide a reproducible version of your df
I'm trying to pull some data from yfinance in Python for different funds from different exchanges. In pulling my data I just set-up the start and end dates through:
start = '2002-01-01'
end = '2022-06-30'
and pulling it through:
assets = ['GOVT', 'IDNA.L', 'IMEU.L', 'EMMUSA.SW', 'EEM', 'IJPD.L', 'VCIT',
'LQD', 'JNK', 'JNKE.L', 'IEF', 'IEI', 'SHY', 'TLH', 'IGIB',
'IHYG.L', 'TIP', 'TLT']
assets.sort()
data = yf.download(assets, start = start, end = end)
I guess you've noticed that the "assets" or the ETFs come from different exchanges such as ".L" or ".SW".
Now the result this:
It seems to me that there is no overlap for a single instrument (i.e. two prices for the same day). So I don't think the data will be disturbed if any scrubbing or clean-up is done.
So my goal is to harmonize or consolidate the prices to its date index rather than date-and-time index so that each price for each instrument is firmly side-by-side each other for a particular date.
Thanks!
If you want the daily last closing price from the yahoo-finance api you could use the interval argument,
yf.download(assets, start=start, end=end, interval="1d")
Solution with Pandas:
Transforming the Index
You have an index where each row is a string representing the datetime. You firstly want to transform those strings to an actual DatetimeIndex where each row will be of type datetime64. This is done in order to easily work with dates in you dataset applying functions from the datetime library. Finally, you pick the date from each datetime64;
data.index = pd.to_datetime(data.index).date
Groupby
Now that you have an index of dates you can groupby on index. Firstly, you want to deal with NaN values. If you want that the closing price is only considered to fill the values within the date itself only you want to apply:
data= data.groupby(data.index).ffill()
Otherwise, if you think that the closing price of (e.g.) the 1st October can be used not only to filter values in the 1st October but also 2nd and 3rd of October which have NaN values, simply apply the ffill() without the groupby;
data= data.ffill()
Lastly, taking last observed record grouping for date (Index); Note that you can apply all the functions you want here, even a custom lambda;
data = data.groupby(data.index).last()
I have a csv-file (called "cameradata") with columns MeetingRoomID, and Time (There are more columns, but they should not be needed).
I would like to get the number of occurrences a certain MeetingRoomID ("14094020", from the column "MeetingRoomID") is used during one day. The csv-file luckily only consist of timestamps from one day in the "Time" column. One problem is that the timestamps are in the datetime-format %H:%M:%S and I want to categorize the occurrences by the hour it occured (between 07:00-18:00).
The goal is to have the occurences linked to the hours of the timestamps - in order for me to plot a barplot with (x = "timestamps (hourly)" and y = "a dataframe/series that maps the certain MeetingRoomID with the hour it was used".
How can I get a function for my y-axis that understands that the value_count for ID 14094020 and the timestamps are connected?
So far I've come up with something like this:
y = cameradata.set_index('Time').resample('H')
cameradata['MeetingRoomID'].value_counts()[14094020]
My code seems to work if I divide it, but I do not know how to connect it in a syntax-friendly way.
Clarification:
The code: cameradata['MeetingRoomID'].value_counts().idxmax() revealed the ID with the most occurrences, so I think I'm onto something there.
Grateful for your help!
This is how the print of the Dataframe looks like, 'Tid' is time and 'MätplatsID' is what I called MeetingRoomID.
For some reason the "Time" column has added a made-up year and month next to it when I converted it to datetime. I converted in to datetime by: kameradata['Tid'] = pd.to_datetime(kameradata['Tid'], format=('%H:%M:%S'))
This is an example of how the output look like in the end
Using the following code I can build a simple table with the current COVID-19 cases worldwide, per country:
url = "https://raw.githubusercontent.com/CSSEGISandData/COVID-19/master/csse_covid_19_data/csse_covid_19_time_series/time_series_covid19_confirmed_global.csv"
raw_data = pd.read_csv(url, sep=",")
raw_data.drop(['Province/State','Lat','Long'], axis = 1, inplace = True)
plot_data = raw_data.groupby('Country/Region').sum()
The plot_data is a simple DataFrame:
What I would like to do now is to subtract the values on each column by the values on the column on a prior day - i.e., I wan to get the new cases per day.
If I do something like plot_data['3/30/20'].add(-plot_data['3/29/20']), it works well. But if I do something like plot_data.iloc[:,68:69].add(-plot_data.iloc[:,67:68]), I got two columns with NaN values. I.e. Python tries to "preserve" de columns header and doesn't perform the operation the way I would like it to.
My goal was to perform this operation in an "elegant way". I was thinking something in the lines of plot_data.iloc[:,1:69].add(-plot_data.iloc[:,0:68]). But of course, if it doesn't work as the single-column example, it doesn't work with multiple columns either (as Python will match the column headers and return a bunch of zeros/NaN values).
Maybe there is a way to tell Python to ignore the headers during an operation with a DataFrame? I know that I can transform my DataFrame into a NumPy array and do a bunch of operations. However, since this is a simple/small table, I thought I would try to keep using a DataFrame data type.
The good old shift can be used on the horizontal axis:
plot_data - plot_data.shift(-1, axis=1)
should be what you want.
Thank you very much #Serge Ballesta! Your answer is exactly the type of "elegant solution" I was looking for. The only comment is the shift sign should be "positive".
plot_data - plot_data.shift(1, axis=1)
This way we bring the historical figures forward one day and now I can subtract it from the actual numbers on each day.
My goal is to convert period to datetime.
If Life Was Easy:
master_df = master_df['Month'].to_datetime()
Back Story:
I built a new dataFrame that originally summed the monthly totals and made a 'Month' column by converting a timestamp to period. Now I want to convert that time period back to a timestamp so that I can create plots using matplotlib.
I have tried following:
Reading the docs for Period.to_timestamp.
Converting to a string and then back to datetime. Still keeps the period issue and won't convert.
Following a couple similar questions in Stackoverflow but could not seem to get it to work.
A simple goal would be to plot the following:
plot.bar(m_totals['Month'], m_totals['Showroom Visits']);
This is the error I get if I try to use a period dtype in my charts
ValueError: view limit minimum 0.0 is less than 1 and is an invalid Matplotlib date value.
This often happens if you pass a non-datetime value to an axis that has datetime units.
Additional Material:
Code I used to create the Month column (where period issue was created):
master_df['Month'] = master_df['Entry Date'].dt.to_period('M')
Codes I used to group to monthly totals:
m_sums = master_df.groupby(['DealerName','Month']).sum().drop(columns={'Avg. Response Time','Closing Percent'})
m_means = master_df.groupby(['DealerName','Month']).mean()
m_means = m_means[['Avg. Response Time','Closing Percent']]
m_totals = m_sums.join(m_means)
m_totals.reset_index(inplace=True)
m_totals
Resulting DataFrame:
I was able to cast the period type to string then to datetime. Just could not go straight from period to datetime.
m_totals['Month'] = m_totals['Month'].astype(str)
m_totals['Month'] = pd.to_datetime(m_totals['Month'])
m_totals.dtypes
I wish I did not get downvoted for not providing the entire dataFrame.
First change it to str then to date
index=pd.period_range(start='1949-01',periods=144 ,freq='M')
type(index)
#changing period to date
index=index.astype(str)
index=pd.to_datetime(index)
df.set_index(index,inplace=True)
type(df.index)
df.info()
Another potential solution is to use to_timestamp. For example: m_totals['Month'] = m_totals['Month'].dt.to_timestamp()