Formatting date data in NumPy array - python

I would be really grateful for an advice. I had an exercise like it's written bellow:
The first column (index 0) contains year values as four digit numbers
in the format YYYY (2016, since all trips in our data set are from
2016). Use assignment to change these values to the YY format (16) in
the test_array ndarray.
I used a code to solve it:
test_array[:,0] = test_array[:,0]%100
But I'm sure it has to be more universal and smart way to get the same results with datetime or smth else. But I cant find it. I tried different variations of this code, but I dont get whats wrong:
dt.datetime.strptime(str(test_array[:,0]), "%Y")
test_array[:,0] = dt.datetime.strftime("%y")
Could you help me with this, please?
Thank you

In order to carry out the conversion of year from YYYY format to YY format would require intermediate datetime value on which operations such as strftime can be carried out in following manner:
df.iloc[:, 0] = df.iloc[:, 0].apply(lambda x: pd.datetime(x, 1, 1).strftime('%y'))
Here to obtain the datetime values we needed 3 args: year, month and date, out of which we had year and the values for rest were assumed to be 1 as default.

Related

Pycharm problem set (Stuck from step 3 onwards)

Using the ff_monthly.csv data set https://github.com/alexpetralia/fama_french,
use the first column as an index
(this contains the year and month of the data as a string
Create a new column ‘Mkt’ as ‘Mkt-RF’ + ‘RF’
Create two new columns in the loaded DataFrame, ‘Month’ and ‘Year’ to
contain the year and month of the dataset extracted from the index column.
Create a new DataFrame with columns ‘Mean’ and ‘Standard
Deviation’ and the full set of years from (b) above.
Write a function which accepts (r_m,s_m) the monthy mean and standard
deviation of a return series and returns a tuple (r_a,s_a), the annualised
mean and standard deviation. Use the formulae: r_a = (1+r_m)^12 -1, and
s_a = s_m * 12^0.5.
Loop through each year in the data, and calculate the annualised mean and
standard deviation of the new ‘Mkt’ column, storing each in the newly
created DataFrame. Note that the values in the input file are % returns, and
need to be divided by 100 to return decimals (i.e the value for August 2022
represents a return of -3.78%).
. Print the DataFrame and output it to a csv file.
Workings so far:
import pandas as pd
ff_monthly=pd.read_csv(r"file path")
ff_monthly=pd.read_csv(r"file path",index_col=0)
Mkt=ff_monthly['Mkt-RF']+ff_monthly['RF']
ff_monthly= ff_monthly.assign(Mkt=Mkt)
df=pd.DataFrame(ff_monthly)
enter image description here
There are a few things to pay attention to.
The Date is the index of your DataFrame. This is treated in a special way compared to the normal columns. This is the reason df.Date gives an Attribute error. Date is not an Attribute, but the index. Instead try df.index
df.Date.str.split("_", expand=True) would work if your Date would look like 22_10. However according to your picture it doesn't contain an underscore and also contains the day, so this cannot work
In fact the format you have is not even following any standard. In order to properly deal with that the best way would be parsing this to a proper datetime64[ns] type that pandas will understand with df.index = pd.to_datetime(df.index, format='%y%m%d'). See the python docu for supported format strings.
If all this works, it should be rather straightforward to create the columns
df['year'] = df.index.dt.year
In fact, this part has been asked before

How to convert an int64 into a datetime?

I'm trying to convert the column Year (type: int64) into a date type so that I can use the Groupby function to group by decade.
I'm using the following code to convert the datatype:
import datetime as dt
crime["Date"]=pd.TimedeltaIndex(crime["Year"], unit='d')+dt.datetime(1960,1,1)
crime[["Year","Date"]].head(10)
Screenshot of output
The date it is returning to me is not correct - it isn't starting at the correct year and the day is increasing by the rows.
I want the year to start at 1960, and for each row the year to increase by 1.
I tried substituting unit='d' in the code above with unit='y' and I get the following result:
Value Error: Units 'M' and 'Y' are no longer supported, as they do not represent unambiguous timedelta value durations.
I think #kate's answer is what you want. I wrote my answer before that one came along. I thought my answer might still be worth something to explain why unit='y' isn't supported, and why unit='d' isn't working for you either...
I wouldn't think this would be right:
TimedeltaIndex(crime["Year"], unit='d')
as I expect this to be interpreting your year count as a count of days. If you can't use unit='y', then maybe there's a good reason for that. Maybe that is because years don't always have the same number of days in them, and so specifying a number of years is ambiguous in terms of the number of days that equates to. You have to add any count of years to an actual year for it to make exact sense.
The same holds true, even moreso, for months, since months have a variety of day counts, so you can have no idea what a timedelta in months really means.
I would add the column in the following way:
crime['Date'] = crime['Year'].map(lambda x: dt.datetime(1960 + x,1,1))

How can I work better with dates in Python to remove NaNs and identify workdays and holidays between two intervals?

I have a dataframe with two date fields as shown below. I want to be able to use this data to calculate 'adjusted pay' for an employee - if the employee joined after the 15th of a month, they are paid for 15 days of March + April on the 10th of the month (payday), and equally if they leave in April, the calculation should only consider the days worked in April.
Hire_Date | Leaving_Date
_________________________
01/02/2007 | NaN
02/03/2007 | NaN
23/03/2020 | Nan
01/01/1999 | 04/04/2020
Oh and the above data didn't pull through in datetime format, and there are of course plenty of NaNs in the leaving_date field :)
Therefore, I did the following:
Converted the data to datetime format, retained the date, and filled N/As with a random date (not too happy about this, but this is only missing in a few records so not worried about the impact).
df['Hire_Date'] = pd.to_datetime(df['Hire_Date'])
df['Hire_Date'] = [a.date() for a in df['Hire_Date']]
df['Hire_Date'] = df['Hire_Date'].fillna('1800-01-01')
Repeated for Leaving date. Only difference here is that I've filled the NaNs with 0, given that we don't have that many leavers.
df['Leaving_Date'] = pd.to_datetime(df['Leaving_Date'])
df['Leaving_Date'] = [a.date() for a in df['Leaving_Date']]
df['Leaving_Date'] = df['Leaving_Date'].fillna('0')
I then ended up creating a fresh column to capture workdays, and here's where I run into the issue. My code is given below.
I identified the first day of the hire month, and attempted to work out the number of days worked in March, using a np.where() function.
df['z_First_Day_H_Month'] = df['Hire_Date'].values.astype('datetime64[M]')
df['March_Workdays'] = np.where((df['z_First_Day_H_Month'] >= '2020-03-01'),
(np.busday_count(df['z_First_Day_H_Month'], '2020-03-31')), 'N/A')
Similar process repeated, albeit a simpler calculation to work out the number of days worked in the termination month.
df['z_First_Day_T_Month'] = df.apply(lambda x: '2020-04-01').astype('datetime64[M]')
df['T_Mth_Workdays'] = df.apply(lambda x: np.busday_count(x['z_First_Day_T_Month'],
x['Leaving_Date'])
However, the above process returns the following error:
iterator operand 0 dtype could not be cast from dtype(' m8 [ns] ') to dtype(' m8 [d] according to rule 'safe' ')
Please can I get some help to fix this issue? Thanks!
I did a bit of research and seems like that the datetime format might be a problem. The [ns] format has precision of nanoseconds and np.busday_count asks for date format, which is [D], causing error. Take a look at this numpy document and check Datetime Units Section.
Numpy, TypeError: Could not be cast from dtype('<M8[us]') to dtype('<M8[D]')
Take a look at this post. It is exact same problem as yours!

Convert column to datetime format, in a leap year

I'm new to Python and programming in general, so I wasn't able to figure out the following: I have a dataframe named ozon, for which column 1 is the time stamp in mm-dd format. Now I want to change that column to a datetime format using the following code:
ozon[1] = pd.to_datetime(ozon[1], format='%m-%d')
Now this is giving me the following error: ValueError: day is out of range for month.
I think it has to do with the fact that it's a leap year, so it doesn't recognize February 29 as a valid date. How can I overcome this error? And could I also add a year to the timestamp (2020)?
Thanks so much in advance!
Add year to column and also to format:
ozon[1] = pd.to_datetime(ozon[1] + '-2000', format='%m-%d-%Y')
If still not working because some values are not valid add errors='coerce' parameter:
ozon[1] = pd.to_datetime(ozon[1] + '-2000', format='%m-%d-%Y', errors='coerce')

How can I calculate the number of days between two dates with different format in Python?

I have a pandas dataframe with a column of orderdates formatted like this: 2019-12-26.
However when I take the max of this date it will give 2019-12-12. While it is actually 2019-12-26. It makes sense because my dateformat is Dutch and the max() function uses the 'American' (correct me if I'm wrong) format.
This meas that my calculations aren't correct.
How I can change the way the function calculate? Or if thats not possible, change the format of my date column so the calculations are correct?
[In] df['orderdate'] = df['orderdate'].astype('datetime64[ns]')
print(df["orderdate"].max())
[Out] 2019-12-12 00:00:00
Thank you!

Categories

Resources