I have a panda table with a column called "date_of_work". This column contains date objects in the following format MM/DD/YYYY. For example 9/19/2016, or 12/5/2016
I'm trying to create a new column that assigns a day from a value between 1 and 365 so I can create a scatter plot with dates on the x axis. I created this function:
def converttoday(datex):
datex=str(datex)
newdate=datex.split('/')
day1=int(newdate[0])*30
day2=int(newdate[1])
finaldate=day1+day2
return finaldate
It ignores the year because I don't care about that (more focused on seasonality). Any idea how to convert this? I'm getting an error when attempting this.
Any help appreciated!
Try this:
df['dayofyear'] = pd.to_datetime(df['date_of_work']).dayofyear
Related
Using the ff_monthly.csv data set https://github.com/alexpetralia/fama_french,
use the first column as an index
(this contains the year and month of the data as a string
Create a new column ‘Mkt’ as ‘Mkt-RF’ + ‘RF’
Create two new columns in the loaded DataFrame, ‘Month’ and ‘Year’ to
contain the year and month of the dataset extracted from the index column.
Create a new DataFrame with columns ‘Mean’ and ‘Standard
Deviation’ and the full set of years from (b) above.
Write a function which accepts (r_m,s_m) the monthy mean and standard
deviation of a return series and returns a tuple (r_a,s_a), the annualised
mean and standard deviation. Use the formulae: r_a = (1+r_m)^12 -1, and
s_a = s_m * 12^0.5.
Loop through each year in the data, and calculate the annualised mean and
standard deviation of the new ‘Mkt’ column, storing each in the newly
created DataFrame. Note that the values in the input file are % returns, and
need to be divided by 100 to return decimals (i.e the value for August 2022
represents a return of -3.78%).
. Print the DataFrame and output it to a csv file.
Workings so far:
import pandas as pd
ff_monthly=pd.read_csv(r"file path")
ff_monthly=pd.read_csv(r"file path",index_col=0)
Mkt=ff_monthly['Mkt-RF']+ff_monthly['RF']
ff_monthly= ff_monthly.assign(Mkt=Mkt)
df=pd.DataFrame(ff_monthly)
enter image description here
There are a few things to pay attention to.
The Date is the index of your DataFrame. This is treated in a special way compared to the normal columns. This is the reason df.Date gives an Attribute error. Date is not an Attribute, but the index. Instead try df.index
df.Date.str.split("_", expand=True) would work if your Date would look like 22_10. However according to your picture it doesn't contain an underscore and also contains the day, so this cannot work
In fact the format you have is not even following any standard. In order to properly deal with that the best way would be parsing this to a proper datetime64[ns] type that pandas will understand with df.index = pd.to_datetime(df.index, format='%y%m%d'). See the python docu for supported format strings.
If all this works, it should be rather straightforward to create the columns
df['year'] = df.index.dt.year
In fact, this part has been asked before
I have a column named as month in which month is given in the format 1,2,3,4.....9 etc. I want to replace them with 01,02,03,....09. I am currently using the code.
vals_to_replace = {'1':'01','2':'02','3':'03','4':'04','5':'05','6':'06','7':'07','8':'08','9':'09'}
data['month'] = data['month'].map(vals_to_replace)
But after I am applying the map function, it is showing me nan values. Why it is happening .Please help
The issue here is that the month column in your dataframe comprises of integer values and you are mapping them to string values in the dictionary. Change it to integer values and it should work..
I have a pandas dataframe and the index column is time with hourly precision. I want to create a new column that compares the value of the column "Sales number" at each hour with the same exact time one week ago.
I know that it can be written in using shift function:
df['compare'] = df['Sales'] - df['Sales'].shift(7*24)
But I wonder how can I take advantage of the date_time format of the index. I mean, is there any alternatives to using shift(7*24) when the index is in date_time format?
Try something with
df['Sales'].shift(7,freq='D')
I am messing around in the NYT covid dataset which has total covid cases for each county, per day.
I would like to find out the difference of cases between each day, so theoretically I could get the number of new cases per day instead of total cases. Taking a rolling mean, or resampling every 2 days using a mean/sum/etc all work just fine. It's just subtracting that is giving me such a headache.
Tried methods:
df.resample('2d').diff()
'DatetimeIndexResampler' object has no attribute 'diff'
df.resample('1d').agg(np.subtract)
ufunc() missing 1 of 2required positional argument(s)
df.rolling(2).diff()
'Rolling' object has no attribute 'diff'
df.rolling('2').agg(np.subtract)
ufunc() missing 1 of 2required positional argument(s)
Sample data:
pd.DataFrame(data={'state':['Alabama','Alabama','Alabama','Alabama','Alabama'],
'date':[dt.date(2020,3,13),dt.date(2020,3,14),dt.date(2020,3,15),dt.date(2020,3,16),dt.date(2020,3,17)],
'covid_cases':[1.2,2.0,2.9,3.6,3.9]
})
Desired sample output:
pd.DataFrame(data={'state':['Alabama','Alabama','Alabama','Alabama','Alabama'],
'date':[dt.date(2020,3,13),dt.date(2020,3,14),dt.date(2020,3,15),dt.date(2020,3,16),dt.date(2020,3,17)],
'new_covid_cases':[np.nan,0.8,0.9,0.7,0.3]
})
Recreate sample data from original NYT dataset:
df = pd.read_csv('https://raw.githubusercontent.com/nytimes/covid-19-data/master/us-counties.csv',parse_dates=['date'])
df.groupby(['state','date'])[['cases']].mean().reset_index()
Any help would be greatly appreciated! Would like to learn how to do this manually/via function rather than finding a "new cases" dataset as I will be working with timeseries a lot in the very near future.
Let's try this bit of complete code:
import pandas as pd
import matplotlib.pyplot as plt
df = pd.read_csv('https://raw.githubusercontent.com/nytimes/covid-19-data/master/us-counties.csv')
df['date'] = pd.to_datetime(df['date'])
df_daily_state = df.groupby(['date','state'])['cases'].sum().unstack()
daily_new_cases_AL = df_daily_state.diff()['Alabama']
ax = daily_new_cases_AL.iloc[-30:].plot.bar(title='Last 30 days Alabama New Cases')
Output:
Details:
Download the historical case records from NYTimes github using the
raw URL
Convert the dtype of the 'date' column to datetime dtype
Groupby 'date' and 'state' columns sum 'cases' and unstack the state
level of the index to get dates of rows and states for columns.
Take the difference by columns and select only the Alabama column
Plot the last 30 days
The diff function is correct, but if you look at your error message:
'DatetimeIndexResampler' object has no attribute 'diff'
in your first tried methods, it's because diff is a function available for DataFrames, not for Resamplers, so turn it back into a DataFrame by specifying how you want to resample it.
If you have the total number of COVID cases for each day and want to resample it to 2 days, you probably only want to keep the latest update out of the two days, in which case something like df.resample('2d').last().diff() should work.
I have a dataframe that has Date as its index. The dataframe has stock market related data so the dates are not continuous. If I want to move lets say 120 rows up in the dataframe, how do I do that. For example:
If I want to get the data starting from 120 trading days before the start of yr 2018, how do I do that below:
df['2018-01-01':'2019-12-31']
Thanks
Try this:
df[df.columns[df.columns.get_loc('2018-01-01'):df.columns.get_loc('2019-12-31')]]
Get location of both Columns in the column array and index them to get the desired.
UPDATE :
Based on your requirement, make some little modifications of above.
Yearly Indexing
>>> df[df.columns[(df.columns.get_loc('2018')).start:(df.columns.get_loc('2019')).stop]]
Above df.columns.get_loc('2018') will give out numpy slice object and will give us index of first element of 2018 (which we index using .start attribute of slice) and similarly we do for index of last element of 2019.
Monthly Indexing
Now consider you want data for First 6 months of 2018 (without knowing what is the first day), the same can be done using:
>>> df[df.columns[(df.columns.get_loc('2018-01')).start:(df.columns.get_loc('2018-06')).stop]]
As you can see above we have indexed the first 6 months of 2018 using the same logic.
Assuming you are using pandas and the dataframe is sorted by dates, a very simple way would be:
initial_date = '2018-01-01'
initial_date_index = df.loc[df['dates']==initial_date].index[0]
offset=120
start_index = initial_date_index-offset
new_df = df.loc[start_index:]