How to convert an int64 into a datetime? - python

I'm trying to convert the column Year (type: int64) into a date type so that I can use the Groupby function to group by decade.
I'm using the following code to convert the datatype:
import datetime as dt
crime["Date"]=pd.TimedeltaIndex(crime["Year"], unit='d')+dt.datetime(1960,1,1)
crime[["Year","Date"]].head(10)
Screenshot of output
The date it is returning to me is not correct - it isn't starting at the correct year and the day is increasing by the rows.
I want the year to start at 1960, and for each row the year to increase by 1.
I tried substituting unit='d' in the code above with unit='y' and I get the following result:
Value Error: Units 'M' and 'Y' are no longer supported, as they do not represent unambiguous timedelta value durations.

I think #kate's answer is what you want. I wrote my answer before that one came along. I thought my answer might still be worth something to explain why unit='y' isn't supported, and why unit='d' isn't working for you either...
I wouldn't think this would be right:
TimedeltaIndex(crime["Year"], unit='d')
as I expect this to be interpreting your year count as a count of days. If you can't use unit='y', then maybe there's a good reason for that. Maybe that is because years don't always have the same number of days in them, and so specifying a number of years is ambiguous in terms of the number of days that equates to. You have to add any count of years to an actual year for it to make exact sense.
The same holds true, even moreso, for months, since months have a variety of day counts, so you can have no idea what a timedelta in months really means.

I would add the column in the following way:
crime['Date'] = crime['Year'].map(lambda x: dt.datetime(1960 + x,1,1))

Related

Having an issue converting a set of very large integers into timedeltas

I am working with a dataframe that includes a column of integers where the units are the number of days since 0001-01-01. I need to convert these integers into current dates. When I attempt use the pd.to_timedelta function to convert these integers into TimeDeltas that I can then add to the start date, the resulting time deltas do not result in what I expect. Please run the following code for an example:
import pandas as pd
df = pd.DataFrame([735110,735111,735112,735114], columns=["days_since_0001-01-01"])
df['time_added'] = pd.to_timedelta(df['days_since_0001-01-01'], unit='d')
print(df[0:1])
#output: 735110 94598 days 01:16:18.871345152
As you can see, the timedelta result for the first row in days is 94,598, plus additional units for the hours/seconds/minutes etc. What I was expecting for that row was 735,110 days, plus 0s for the rest of the timestamp. Further, if I use a DataFrame with a smaller number of days, the output is as I would expect. I have come to the conclusion that to_timedelta cannot handle very large numbers of days; however, I do not know of an alternative method to do this. I could simply reduce the number of days by an arbitrary amount and increase the start date, but I would still need to know the proper amount by which to reduce the days integer and the amount by which to increase the start date. Any help would be appreciated.
Every year is 365 days, except for leap years which are 366 days and occur every four years. So I just need to pick an arbitrary point within the acceptable timedelta resolutions, (e.x. 1850), calculate the number of regular years and leap years between 0000-01-01 and 1850-01-01, multiply regular by 365 and leap by 366, then subtract that from my days_since_0001-01-01. Then I can set a start date as 1850-01-01, and add the remaining amount of days to that start date, to get the current date.

Pandas reads date from CSV incorrectly

I am very new to Python, and finding it very frustrating.
I have a CSV that I am importing, but its reading the date column incorrectly.
In the Month column, I have the 1st of each month - so it should read (yyyy-mm-dd):
2020-01-01
2020-02-01
2020-03-01
etc
however, its reading it as (yyyy-dd-mm)
2020-01-01
2020-01-02
2020-01-03
etc
I've tried several conversion functions from stackoverflow as well as other websites, but they either just don't work, or do nothing.
My import is as follows:
try:
collections_data = pd.read_csv('./monthly_collections.csv')
print("Collections Data imported successfully.")
except error as e:
print("Error importing Collections Data!")
I have tried the parse_dates parameter on the import, but it doesn't help.
If I then try this:
temp = pd.to_datetime(collections_data['Collections Month'], format='%m/%d/%Y')
temp
then I get
which you can see, it is reading the months as the days - in other words, it is showing individual days of the month, instead of the 1st day of each month.
I'd greatly appreciate some help to get these dates corrected, as I need to do some date calculations on them, and also join two tables based on this date - which is going to be my next problem.
Kind Regards
Inferring date format
Some dates are ambiguous, while others aren't. Consider these dates:
2020-27-01
2020-12-14
2020-01-02
11-10-12
In examples #1 & #2 we can easily infer that date format. In example #1, The first four digit have to be the year (there's no 2020th month or 2020th day of a month), the following two digits have to be the day of the month (there's no 27th month and we already have year information) and the last two digits are the month (we already have year and day of month information). We can use a similar approach for example #2.
For example #3 is that the first day of the second month, or is that the second day of the first month? It's impossible to tell without more information. If for instances we had the following sequence of dates: '2020-22-01', '2020-25-01', '2020-01-02', it would be reasonable to infer that '2020-01-02' refers to the first day of the second month, otherwise we would not be able to parse the previous two dates.
In example #4, it's impossible to infer the date format. Either pair of digits would make sense as a year, month or day. (Using pandas.read_csv() you can make use of the dayfirst and yearfirst kwargs, or explicitly declare your date formats and use pandas.to_datetime(some_df, format=).
Your problem
Your dates are ambiguous, from what you've included in your question is not possible to infer whether it's in a day first format (dd-mm) or a month first format (mm-dd). pandas defaults to dayfirst=False so a date like your date 2020-02-01 is expected to mean the second day of the first month unless you specific otherwise. See pandas.read_csv().
dayfirst : bool, default False
DD/MM format dates, international and European format.
Above means that in order to parse 01/02 (DD/MM), 2020/02/01 (iso/international format) or 01/02/2020 (European format) as the first day of the second month you will need to specify pandas.read_csv(somefile.csv, ... dayfirst=True).
I've tried several conversion functions from stackoverflow as well as other websites, but they either just don't work, or do nothing.
You haven't provided the code that you've used that didn't work, nor the code which you used which parsed your dates as month first. If you include an example of what you actually tried I can make a specific comment.
In your question you say that your date format is in (yyyy-mm-dd) but you passed format='%m/%d/%Y' and in your screenshots you have '/' and '-' as your separator in different places. So I'm not sure what your original dates look like.
What you passed to the format kwarg means the first two digits are zero-passed months (i.e 04) followed by a '/' then zero-padded days, followed by '/' and then year as yyyy. If what you wrote at the beginning of your question is correct you should have passed format='%Y-%m-%d' (see the strftime format codes).
Try https://towardsdatascience.com/4-tricks-you-should-know-to-parse-date-columns-with-pandas-read-csv-27355bb2ad0e
Essentially, try the dayfirst optional input for the read_csv function.
You would set it to True and have
collections_data = pd.read_csv('./monthly_collections.csv', dayfirst = True)

Formatting date data in NumPy array

I would be really grateful for an advice. I had an exercise like it's written bellow:
The first column (index 0) contains year values as four digit numbers
in the format YYYY (2016, since all trips in our data set are from
2016). Use assignment to change these values to the YY format (16) in
the test_array ndarray.
I used a code to solve it:
test_array[:,0] = test_array[:,0]%100
But I'm sure it has to be more universal and smart way to get the same results with datetime or smth else. But I cant find it. I tried different variations of this code, but I dont get whats wrong:
dt.datetime.strptime(str(test_array[:,0]), "%Y")
test_array[:,0] = dt.datetime.strftime("%y")
Could you help me with this, please?
Thank you
In order to carry out the conversion of year from YYYY format to YY format would require intermediate datetime value on which operations such as strftime can be carried out in following manner:
df.iloc[:, 0] = df.iloc[:, 0].apply(lambda x: pd.datetime(x, 1, 1).strftime('%y'))
Here to obtain the datetime values we needed 3 args: year, month and date, out of which we had year and the values for rest were assumed to be 1 as default.

Add future date-time to pandas df

This is, I think, a rather simple question which I have not been able to find a proper answer.
I have a pandas dataframe with the following characteristics
shape(frame)
Out[117]: (3652, 2)
Here 3652 refers to days within a decade (3652 since we have 2 leap years)
I would like to add a third column that shows date range between 2035-01-01 and 2044-12-31
Many thanks

Python: Date conversion to year-weeknumber, issue at switch of year

I am trying to convert a dataframe column with a date and timestamp to a year-weeknumber format, i.e., 01-05-2017 03:44 = 2017-1. This is pretty easy, however, I am stuck at dates that are in a new year, yet their weeknumber is still the last week of the previous year. The same thing that happens here.
I did the following:
df['WEEK_NUMBER'] = df.date.dt.year.astype(str).str.cat(df.date.dt.week.astype(str), sep='-')
Where df['date'] is a very large column with date and times, ranging over multiple years.
A date which gives a problem is for example:
Timestamp('2017-01-01 02:11:27')
The output for my code will be 2017-52, while it should be 2016-52. Since the data covers multiple years, and weeknumbers and their corresponding dates change every year, I cannot simply subtract a few days.
Does anybody have an idea of how to fix this? Thanks!
Replace df.date.dt.year by this:
(df.date.dt.year- ((df.date.dt.week>50) & (df.date.dt.month==1)))
Basically, it means that you will substract 1 to the year value if the week number is greater than 50 and the month is January.

Categories

Resources