How to convert two columns from decimal years to date - python

I'm new in Python and I have a problem.
I have two columns of data in decimal year in a .txt document and I want to trasform each number in the two columns to data (yyyy-mm-dd)
2014.16020 2019.07190
2000.05750 2019.10750
2001.82610 2019.10750
2010.36280 2019.07190
2005.24570 2019.10750
2015.92610 2019.10750
2003.43600 2014.37100
and then subtract the data of the second column from the data of the first column in order to obtain the number of days between the two datas.
for example the restult should be like:
1825
3285
2920
3283
ecc..

Give proper path of your file and output file. The below code would do the rest.
import pandas as pd
import numpy as np
df=pd.read_csv('your_file.txt',delimiter=' ',header=None,parse_dates=[0,1])
df['date_diffrence']=((df[1]-df[0])/np.timedelta64(1,'D')).astype(int)
df.to_csv('your_file_result.txt',header=None,sep=' ',index=False)

I put my explanations in the code below
from datetime import datetime # yes they named a class the same as module
x = '''2014.16020 2019.07190
2000.05750 2019.10750
2001.82610 2019.10750
2010.36280 2019.07190
2005.24570 2019.10750
2015.92610 2019.10750
2003.43600 2014.37100'''
# split input into lines. Assumption here is that there is one pair of dates per each line
lines = x.splitlines()
# set up a container (list) for outputs
deltas = []
# process line by line
for line in lines:
# split line into separate dates
inputs = line.split()
dates = []
for input in inputs:
# convert text to number
date_decimal = float(input)
# year is the integer part of the input
date_year = int(date_decimal)
# number of days is part of the year, which is left after we subtract year
year_fraction = date_decimal - date_year
# a little oversimplified here with int and assuming all years have 365 days
days = int(year_fraction * 365)
# now convert the year and days into string and then into date (there is probably a better way to do this - without the string step)
date = datetime.strptime("{}-{}".format(date_year, days),"%Y-%j")
# see https://docs.python.org/3/library/datetime.html#strftime-strptime-behavior for format explanation
dates.append(date)
deltas.append(dates[1] - dates[0])
# now print outputs
for delta in deltas:
print(delta.days)

Related

Combining Year and DayOfYear, H:M:S columns into date time object

I have a time column with the format XXXHHMMSS where XXX is the Day of Year. I also have a year column. I want to merge both these columns into one date time object.
Before I had detached XXX into a new column but this was making it more complicated.
I've converted the two columns to strings
points['UTC_TIME'] = points['UTC_TIME'].astype(str)
points['YEAR_'] = points['YEAR_'].astype(str)
Then I have the following line:
points['Time'] = pd.to_datetime(points['YEAR_'] * 1000 + points['UTC_TIME'], format='%Y%j%H%M%S')
I'm getting the value errorr, ValueError: time data '137084552' does not match format '%Y%j%H%M%S' (match)
Here is a photo of my columns and a link to the data
works fine for me if you combine both columns as string, EX:
import pandas as pd
df = pd.DataFrame({'YEAR_': [2002, 2002, 2002],
'UTC_TIME': [99082552, 135082552, 146221012]})
pd.to_datetime(df['YEAR_'].astype(str) + df['UTC_TIME'].astype(str).str.zfill(9),
format="%Y%j%H%M%S")
# 0 2002-04-09 08:25:52
# 1 2002-05-15 08:25:52
# 2 2002-05-26 22:10:12
# dtype: datetime64[ns]
Note, since %j expects zero-padded day of year, you might need to zero-fill, see first row in the example above.

Split based on _ and find the difference between dates

I am trying to find the difference between the below two dates. It is in the format of "dd-mm-yyyy". I splitted the two strings based on _ and extract the date, month and year.
previous_date = "filename_03_03_2021"
current_date = "filename_09_03_2021"
previous_array = previous_date.split("_")
Not sure after that what could be done to combine them into a date format and find the difference between dates in "days".
Any leads/suggestions would be appreciated.
You could index into the list after split like previous_array[1] to get the values and add those to date
But tnstead of using split, you might use a pattern with 3 capture groups to make the match a bit more specific to get the numbers and then subtract the dates and get the days value.
You might make the date like pattern more specific using the pattern on this page
import re
from datetime import date
previous_date = "filename_03_03_2021"
current_date = "filename_09_03_2021"
pattern = r"filename_(\d{2})_(\d{2})_(\d{4})"
pMatch = re.match(pattern, previous_date)
cMatch = re.match(pattern, current_date)
pDate = date(int(pMatch.group(3)), int(pMatch.group(2)), int(pMatch.group(1)))
cDate = date(int(cMatch.group(3)), int(cMatch.group(2)), int(cMatch.group(1)))
print((cDate - pDate).days)
Output
6
See a Python demo

Convert Timestamp to Date only

I've been looking through every thread that I can find, and the only one that is relevant to this type of formatting issue is here, but it's for java...
How parse 2013-03-13T20:59:31+0000 date string to Date
I've got a column with values like 201604 and 201605 that I need to turn into date values like 2016-04-01 and 2016-05-01. To accomplish this, I've done what is below.
#Create Number to build full date
df['DAY_NBR'] = '01'
#Convert Max and Min date to string to do date transformation
df['MAXDT'] = df['MAXDT'].astype(str)
df['MINDT'] = df['MINDT'].astype(str)
#Add the day number to the max date month and year
df['MAXDT'] = df['MAXDT'] + df['DAY_NBR']
#Add the day number to the min date month and year
df['MINDT'] = df['MINDT'] + df['DAY_NBR']
#Convert Max and Min date to integer values
df['MAXDT'] = df['MAXDT'].astype(int)
df['MINDT'] = df['MINDT'].astype(int)
#Convert Max date to datetime
df['MAXDT'] = pd.to_datetime(df['MAXDT'], format='%Y%m%d')
#Convert Min date to datetime
df['MINDT'] = pd.to_datetime(df['MINDT'], format='%Y%m%d')
To be honest, I can work with this output, but it's a little messy because the unique values for the two columns are...
MAXDT Values
['2016-07-01T00:00:00.000000000' '2017-09-01T00:00:00.000000000'
'2018-06-01T00:00:00.000000000' '2017-07-01T00:00:00.000000000'
'2017-03-01T00:00:00.000000000' '2018-12-01T00:00:00.000000000'
'2017-12-01T00:00:00.000000000' '2019-01-01T00:00:00.000000000'
'2018-09-01T00:00:00.000000000' '2018-10-01T00:00:00.000000000'
'2016-04-01T00:00:00.000000000' '2018-03-01T00:00:00.000000000'
'2017-05-01T00:00:00.000000000' '2018-08-01T00:00:00.000000000'
'2017-02-01T00:00:00.000000000' '2016-12-01T00:00:00.000000000'
'2018-01-01T00:00:00.000000000' '2018-02-01T00:00:00.000000000'
'2017-06-01T00:00:00.000000000' '2018-11-01T00:00:00.000000000'
'2018-05-01T00:00:00.000000000' '2019-11-01T00:00:00.000000000'
'2016-06-01T00:00:00.000000000' '2017-10-01T00:00:00.000000000'
'2016-08-01T00:00:00.000000000' '2018-04-01T00:00:00.000000000'
'2016-03-01T00:00:00.000000000' '2016-10-01T00:00:00.000000000'
'2016-11-01T00:00:00.000000000' '2019-12-01T00:00:00.000000000'
'2016-09-01T00:00:00.000000000' '2017-08-01T00:00:00.000000000'
'2016-05-01T00:00:00.000000000' '2017-01-01T00:00:00.000000000'
'2017-11-01T00:00:00.000000000' '2018-07-01T00:00:00.000000000'
'2017-04-01T00:00:00.000000000' '2016-01-01T00:00:00.000000000'
'2016-02-01T00:00:00.000000000' '2019-02-01T00:00:00.000000000'
'2019-07-01T00:00:00.000000000' '2019-10-01T00:00:00.000000000'
'2019-09-01T00:00:00.000000000' '2019-03-01T00:00:00.000000000'
'2019-05-01T00:00:00.000000000' '2019-04-01T00:00:00.000000000'
'2019-08-01T00:00:00.000000000' '2019-06-01T00:00:00.000000000'
'2020-02-01T00:00:00.000000000' '2020-01-01T00:00:00.000000000']
MINDT Values
['2016-04-01T00:00:00.000000000' '2017-07-01T00:00:00.000000000'
'2016-02-01T00:00:00.000000000' '2017-01-01T00:00:00.000000000'
'2017-02-01T00:00:00.000000000' '2018-12-01T00:00:00.000000000'
'2017-08-01T00:00:00.000000000' '2018-04-01T00:00:00.000000000'
'2017-10-01T00:00:00.000000000' '2019-01-01T00:00:00.000000000'
'2018-05-01T00:00:00.000000000' '2018-09-01T00:00:00.000000000'
'2018-10-01T00:00:00.000000000' '2016-01-01T00:00:00.000000000'
'2016-03-01T00:00:00.000000000' '2017-11-01T00:00:00.000000000'
'2017-05-01T00:00:00.000000000' '2018-07-01T00:00:00.000000000'
'2018-06-01T00:00:00.000000000' '2017-12-01T00:00:00.000000000'
'2016-10-01T00:00:00.000000000' '2018-02-01T00:00:00.000000000'
'2017-06-01T00:00:00.000000000' '2018-08-01T00:00:00.000000000'
'2018-03-01T00:00:00.000000000' '2018-11-01T00:00:00.000000000'
'2016-08-01T00:00:00.000000000' '2016-06-01T00:00:00.000000000'
'2018-01-01T00:00:00.000000000' '2016-07-01T00:00:00.000000000'
'2016-11-01T00:00:00.000000000' '2016-09-01T00:00:00.000000000'
'2017-04-01T00:00:00.000000000' '2016-05-01T00:00:00.000000000'
'2017-09-01T00:00:00.000000000' '2016-12-01T00:00:00.000000000'
'2017-03-01T00:00:00.000000000']
I'm trying to build a loop that runs through these dates, and it works, but I don't want to have an index with all of these irrelevant zeros and a T in it. How can I convert these empty timestamp values to just the date that is in yyyy-mm-dd format?
Thank you!
Unfortunately, I believe Pandas always stores datetime objects as datetime64[ns], meaning the precision has to be like that. Even if you attempt to save as datetime64[D], it will be cast to datetime64[ns].
It's possible to just store these datetime objects as strings instead, but the simplest solution is likely to just strip the extra zeroes when you're looping through them (i.e, using df['MAXDT'].to_numpy().astype('datetime64[D]') and looping through the formatted numpy array), or just reformatting using datetime.

Python calculate the number of year in date column

I've recently start coding with Python, and I'm struggling to calculate the number of years between the current date and a given date.
Dataframe
I would like to calculate the number of year for each column.
I tried this but it's not working:
def Number_of_years(d1,d2):
if d1 is not None:
return relativedelta(d2,d1).years
for col in df.select_dtypes(include=['datetime64[ns]']):
df[col]=Number_of_years(df[col],date.today())
Can anyone help me find a solution to this?
I see that the format of dates is day/month/year.
Given this format is same for all the grids, you can parse the date using the datetime module like so:
from datetime import datetime # import module
def numberOfYears(element):
# parse the date string according to the fixed format
date = datetime.strptime(element, '%d/%m/%Y')
# return the difference in the years
return datetime.today().year - date.year
# make things more interesting by vectorizing this function
function = np.vectorize(numberOfYears)
# This returns a numpy array containing difference between years.
# call this for each column, and you should be good
difference = function(df.Date_creation)
You code is basically right, but you're operating over a pandas series so you can't just call relativedelta directly:
def number_of_years(d1,d2):
return relativedelta(d2,d1).years
for col in df.select_dtypes(include=['datetime64[ns]']):
df[col]= df[col].apply(lambda d: number_of_years(x, date.today()))

Pandas converting date with string in

I'm starting with python and pandas and matplotlib. I'm working with data with over million entries. I'm trying to change the date format. In CSV file date format is 23-JUN-11. I will like to use dates in future to plot amount of donation for each candidate. How to convert the date format to a readable format for pandas?
Here is the link to cut file 149 entries
My code:
%matplotlib
import matplotlib.pyplot as plt
import pandas as pd
import numpy as np
First candidate
reader_bachmann = pd.read_csv('P00000001-ALL.csv' ,converters={'cand_id': lambda x: str(x)[1:]},parse_dates=True, squeeze=True, low_memory=False, nrows=411 )
date_frame = pd.DataFrame(reader_bachmann, columns = ['contb_receipt_dt'])
Data slice
s = date_frame.iloc[:,0]
date_slice = pd.Series([s])
date_strip = date_slice.str.replace('JUN','6')
Trying to convert to new date format
date = pd.to_datetime(s, format='%d%b%Y')
print(date_slice)
Here is the error message
ValueError: could not convert string to float: '05-JUL-11'
You need to use a different date format string:
format='%d-%b-%y'
Why?
The error message gives a clue as to what is wrong:
ValueError: could not convert string to float: '05-JUL-11'
The format string controls the conversion, and is currently:
format='%d%b%Y'
And the fields needed are:
%y - year without a century (range 00 to 99)
%b - abbreviated month name
%d - day of the month (01 to 31)
What is missing is the - that are separating the field in your data string, and the y for a two digit year instead of the current Y for a four digit year.
As an alternative you can use dateutil.parser to parse dates containing string directly, I have created a random dataframe for demo.
l = []
for i in range(100):
l.append('23-JUN-11')
B = pd.DataFrame({'Date':l})
Now, Let's import dateutil.parser and apply it on our date column
import dateutil.parser
B['Date2'] = B['Date'].apply(lambda x : dateutil.parser.parse(x))
B.head()
Out[106]:
Date Date2
0 23-JUN-11 2011-06-23
1 23-JUN-11 2011-06-23
2 23-JUN-11 2011-06-23
3 23-JUN-11 2011-06-23
4 23-JUN-11 2011-06-23

Categories

Resources