Pandas reads date from CSV incorrectly - python

I am very new to Python, and finding it very frustrating.
I have a CSV that I am importing, but its reading the date column incorrectly.
In the Month column, I have the 1st of each month - so it should read (yyyy-mm-dd):
2020-01-01
2020-02-01
2020-03-01
etc
however, its reading it as (yyyy-dd-mm)
2020-01-01
2020-01-02
2020-01-03
etc
I've tried several conversion functions from stackoverflow as well as other websites, but they either just don't work, or do nothing.
My import is as follows:
try:
collections_data = pd.read_csv('./monthly_collections.csv')
print("Collections Data imported successfully.")
except error as e:
print("Error importing Collections Data!")
I have tried the parse_dates parameter on the import, but it doesn't help.
If I then try this:
temp = pd.to_datetime(collections_data['Collections Month'], format='%m/%d/%Y')
temp
then I get
which you can see, it is reading the months as the days - in other words, it is showing individual days of the month, instead of the 1st day of each month.
I'd greatly appreciate some help to get these dates corrected, as I need to do some date calculations on them, and also join two tables based on this date - which is going to be my next problem.
Kind Regards

Inferring date format
Some dates are ambiguous, while others aren't. Consider these dates:
2020-27-01
2020-12-14
2020-01-02
11-10-12
In examples #1 & #2 we can easily infer that date format. In example #1, The first four digit have to be the year (there's no 2020th month or 2020th day of a month), the following two digits have to be the day of the month (there's no 27th month and we already have year information) and the last two digits are the month (we already have year and day of month information). We can use a similar approach for example #2.
For example #3 is that the first day of the second month, or is that the second day of the first month? It's impossible to tell without more information. If for instances we had the following sequence of dates: '2020-22-01', '2020-25-01', '2020-01-02', it would be reasonable to infer that '2020-01-02' refers to the first day of the second month, otherwise we would not be able to parse the previous two dates.
In example #4, it's impossible to infer the date format. Either pair of digits would make sense as a year, month or day. (Using pandas.read_csv() you can make use of the dayfirst and yearfirst kwargs, or explicitly declare your date formats and use pandas.to_datetime(some_df, format=).
Your problem
Your dates are ambiguous, from what you've included in your question is not possible to infer whether it's in a day first format (dd-mm) or a month first format (mm-dd). pandas defaults to dayfirst=False so a date like your date 2020-02-01 is expected to mean the second day of the first month unless you specific otherwise. See pandas.read_csv().
dayfirst : bool, default False
DD/MM format dates, international and European format.
Above means that in order to parse 01/02 (DD/MM), 2020/02/01 (iso/international format) or 01/02/2020 (European format) as the first day of the second month you will need to specify pandas.read_csv(somefile.csv, ... dayfirst=True).
I've tried several conversion functions from stackoverflow as well as other websites, but they either just don't work, or do nothing.
You haven't provided the code that you've used that didn't work, nor the code which you used which parsed your dates as month first. If you include an example of what you actually tried I can make a specific comment.
In your question you say that your date format is in (yyyy-mm-dd) but you passed format='%m/%d/%Y' and in your screenshots you have '/' and '-' as your separator in different places. So I'm not sure what your original dates look like.
What you passed to the format kwarg means the first two digits are zero-passed months (i.e 04) followed by a '/' then zero-padded days, followed by '/' and then year as yyyy. If what you wrote at the beginning of your question is correct you should have passed format='%Y-%m-%d' (see the strftime format codes).

Try https://towardsdatascience.com/4-tricks-you-should-know-to-parse-date-columns-with-pandas-read-csv-27355bb2ad0e
Essentially, try the dayfirst optional input for the read_csv function.
You would set it to True and have
collections_data = pd.read_csv('./monthly_collections.csv', dayfirst = True)

Related

How to convert an int64 into a datetime?

I'm trying to convert the column Year (type: int64) into a date type so that I can use the Groupby function to group by decade.
I'm using the following code to convert the datatype:
import datetime as dt
crime["Date"]=pd.TimedeltaIndex(crime["Year"], unit='d')+dt.datetime(1960,1,1)
crime[["Year","Date"]].head(10)
Screenshot of output
The date it is returning to me is not correct - it isn't starting at the correct year and the day is increasing by the rows.
I want the year to start at 1960, and for each row the year to increase by 1.
I tried substituting unit='d' in the code above with unit='y' and I get the following result:
Value Error: Units 'M' and 'Y' are no longer supported, as they do not represent unambiguous timedelta value durations.
I think #kate's answer is what you want. I wrote my answer before that one came along. I thought my answer might still be worth something to explain why unit='y' isn't supported, and why unit='d' isn't working for you either...
I wouldn't think this would be right:
TimedeltaIndex(crime["Year"], unit='d')
as I expect this to be interpreting your year count as a count of days. If you can't use unit='y', then maybe there's a good reason for that. Maybe that is because years don't always have the same number of days in them, and so specifying a number of years is ambiguous in terms of the number of days that equates to. You have to add any count of years to an actual year for it to make exact sense.
The same holds true, even moreso, for months, since months have a variety of day counts, so you can have no idea what a timedelta in months really means.
I would add the column in the following way:
crime['Date'] = crime['Year'].map(lambda x: dt.datetime(1960 + x,1,1))

How can I work better with dates in Python to remove NaNs and identify workdays and holidays between two intervals?

I have a dataframe with two date fields as shown below. I want to be able to use this data to calculate 'adjusted pay' for an employee - if the employee joined after the 15th of a month, they are paid for 15 days of March + April on the 10th of the month (payday), and equally if they leave in April, the calculation should only consider the days worked in April.
Hire_Date | Leaving_Date
_________________________
01/02/2007 | NaN
02/03/2007 | NaN
23/03/2020 | Nan
01/01/1999 | 04/04/2020
Oh and the above data didn't pull through in datetime format, and there are of course plenty of NaNs in the leaving_date field :)
Therefore, I did the following:
Converted the data to datetime format, retained the date, and filled N/As with a random date (not too happy about this, but this is only missing in a few records so not worried about the impact).
df['Hire_Date'] = pd.to_datetime(df['Hire_Date'])
df['Hire_Date'] = [a.date() for a in df['Hire_Date']]
df['Hire_Date'] = df['Hire_Date'].fillna('1800-01-01')
Repeated for Leaving date. Only difference here is that I've filled the NaNs with 0, given that we don't have that many leavers.
df['Leaving_Date'] = pd.to_datetime(df['Leaving_Date'])
df['Leaving_Date'] = [a.date() for a in df['Leaving_Date']]
df['Leaving_Date'] = df['Leaving_Date'].fillna('0')
I then ended up creating a fresh column to capture workdays, and here's where I run into the issue. My code is given below.
I identified the first day of the hire month, and attempted to work out the number of days worked in March, using a np.where() function.
df['z_First_Day_H_Month'] = df['Hire_Date'].values.astype('datetime64[M]')
df['March_Workdays'] = np.where((df['z_First_Day_H_Month'] >= '2020-03-01'),
(np.busday_count(df['z_First_Day_H_Month'], '2020-03-31')), 'N/A')
Similar process repeated, albeit a simpler calculation to work out the number of days worked in the termination month.
df['z_First_Day_T_Month'] = df.apply(lambda x: '2020-04-01').astype('datetime64[M]')
df['T_Mth_Workdays'] = df.apply(lambda x: np.busday_count(x['z_First_Day_T_Month'],
x['Leaving_Date'])
However, the above process returns the following error:
iterator operand 0 dtype could not be cast from dtype(' m8 [ns] ') to dtype(' m8 [d] according to rule 'safe' ')
Please can I get some help to fix this issue? Thanks!
I did a bit of research and seems like that the datetime format might be a problem. The [ns] format has precision of nanoseconds and np.busday_count asks for date format, which is [D], causing error. Take a look at this numpy document and check Datetime Units Section.
Numpy, TypeError: Could not be cast from dtype('<M8[us]') to dtype('<M8[D]')
Take a look at this post. It is exact same problem as yours!

Aligning datetime formats for comparrison

I'm having trouble align two different dates. I have an excel import which I turn into a DateTime in pandas and I would like to compare this DateTime with the current DateTime. The troubles are in the formatting of the imported DateTime.
Excel format of the date:
2020-07-06 16:06:00 (which is yyyy-dd-mm hh:mm:ss)
When I add the DateTime to my DataFrame it creates the datatype Object. After I convert it with pd.to_datetime it creates the format yyyy-mm-dd hh:mm:ss. It seems that the month and the day are getting mixed up.
Example code:
df = pd.read_excel('my path')
df['Arrival'] = pd.to_datetime(df['Arrival'], format='%Y-%d-%m %H:%M:%S')
print(df.dtypes)
Expected result:
2020-06-07 16:06:00
Actual result:
2020-07-06 16:06:00
How do I resolve this?
Gr,
Sempah
An ISO-8601 date/time is always yyyy-MM-dd, not yyyy-dd-MM. You've got the month and date positions switched around.
While localized date/time strings are inconsistent about the order of month and date, this particular format where the year comes first always starts with the biggest units (years) and decreases in unit size going right (month, date, hour, etc.)
It's solved. I think that I misunderstood the results. It already was working without me knowledge. Thanks for the help anyway.

Python: Date conversion to year-weeknumber, issue at switch of year

I am trying to convert a dataframe column with a date and timestamp to a year-weeknumber format, i.e., 01-05-2017 03:44 = 2017-1. This is pretty easy, however, I am stuck at dates that are in a new year, yet their weeknumber is still the last week of the previous year. The same thing that happens here.
I did the following:
df['WEEK_NUMBER'] = df.date.dt.year.astype(str).str.cat(df.date.dt.week.astype(str), sep='-')
Where df['date'] is a very large column with date and times, ranging over multiple years.
A date which gives a problem is for example:
Timestamp('2017-01-01 02:11:27')
The output for my code will be 2017-52, while it should be 2016-52. Since the data covers multiple years, and weeknumbers and their corresponding dates change every year, I cannot simply subtract a few days.
Does anybody have an idea of how to fix this? Thanks!
Replace df.date.dt.year by this:
(df.date.dt.year- ((df.date.dt.week>50) & (df.date.dt.month==1)))
Basically, it means that you will substract 1 to the year value if the week number is greater than 50 and the month is January.

Is there an equivalent for dateutil relativedelta(weekday=FR) but for MONTHS?

I would like to parse dates in a script and I was wondering if there is an equivalent of the:
relativedelta(days=1, weekday=MO)
but for months?
For now, I extract the month number in my text and compare it to the document's creation date (and month). However, this is quite long and repetitive (and I have to do it for future tenses, present tenses and past tenses)...
Adding relativedelta(month=2) to a datetime object will give you the same date and time, except in February. If this creates a non-existent date, the date will be truncated at the last existing date, e.g.:
from datetime import datetime
from dateutil.relativedelta import relativedelta
print(datetime(2015, 3, 30) + relativedelta(month=2)) # 2015-02-28 00:00:00
As explained in the relativedelta documentation:
year, month, day, hour, minute, second, microsecond:
Absolute information (argument is singular); adding or subtracting a
relativedelta with absolute information does not perform an aritmetic
operation, but rather REPLACES the corresponding value in the
original datetime with the value(s) in relativedelta.
All the "singular" arguments are processed as "set this component of the thing I'm being added to/subtracted from to this value", whereas the plural versions of the same arguments say "Add/subtract this number to/from this component".
Note that the relativedelta documentation also lays out the order that each component is applied, but suffice to say that absolute values are applied before relative values, so relativedelta(month=3, months=2) will set the month to March, then add 2 months (so, basically, it's equivalent to relativedelta(month=5)).
The equivalent of weekday = MO (or weekday = calendar.MONDAY) can be month = 1 for January, month = 2 for February, and so on.
These operations are handled most gracefully by the Python arrow library. Datetime objects can be converted to Arrow objects and vice versus if you need to continue using datetime, i.e. for compatibility with other modules.

Categories

Resources