How do I extract dates from an Excel sheet? - python

I am trying to extract dates from an Excel sheet using the pandas library.
data = pd.read_excel (import_file_path)
df = pd.DataFrame(data,columns = ['birthday'])
This works but I don't really know how to work with DataFrames and I just need a list/array of the ages, so I tried to convert it into a numpy array:
array = df.to_numpy()
This works fine as well but elements of the array look like:
[datetime.datetime(1983, 6, 4, 0, 0)]
But I can't use the methods provided by datetime to convert the dates.
What would be the best approach to get a list/array of ages eventually?
Update:
Birthday
1 2002-03-15 00:00:00
2 1999-04-17 00:00:00
3 1993-06-04 00:00:00
4 1997-07-04 00:00:00
5 1983-08-09 00:00:00
6 2000-01-10 00:00:00
7 1996-08-20 00:00:00
8 2003-11-06 00:00:00

assuming your dates column is called birthday then something like the following :
df = pd.DataFrame({'Birthday' : pd.date_range(start='01/01/88',end='02/02/95',freq='M')})
df['Today'] = pd.datetime(2019,6,13) # probably better to use the datetime module.
df['Years'] = (df['Today'] - df['Birthday']) / np.timedelta64(1, 'Y')
print(df.head(5))
Birthday Today Years
0 1988-01-31 2019-06-13 31.365463
1 1988-02-29 2019-06-13 31.286063
2 1988-03-31 2019-06-13 31.201188
3 1988-04-30 2019-06-13 31.119051
4 1988-05-31 2019-06-13 31.034176
Then simply cast the col to a np.array
a = np.array(df['Years'])
print(a)
array([31.36546267, 31.28606337, 31.20118825, 31.11905104, 31.03417592,
30.95203871, 30.8671636 , 30.78228848, 30.70015127, 30.61527615,
30.53313894, 30.44826382, 30.36338871, 30.28672731, 30.20185219,
30.11971498, 30.03483987, 29.95270266, 29.86782754, 29.78295242]

Ok there was a row with irregular data in it, which messed up the conversion.
Handling of the types works fine now, thanks!

Related

how to enter manually a Python dataframe with daily dates in a correct format

I would like to (manually) create in Python a dataframe with daily dates (in column 'date') as per below code.
But the code does not provide the correct format for the daily dates, neglects dates (the desired format representation is below).
Could you please advise how I can correct the code so that the 'date' column is entered in a desired format?
Thanks in advance!
------------------------------------------------------
desired format for date column
2021-03-22 3
2021-04-07 3
2021-04-18 3
2021-05-12 0
------------------------------------------------------
df1 = pd.DataFrame({"date": [2021-3-22, 2021-4-7, 2021-4-18, 2021-5-12],
"x": [3, 3, 3, 0 ]})
df1
date x
0 1996 3
1 2010 3
2 1999 3
3 2004 0
Python wants to interpret the numbers in the sequence 2021-3-22 as a series of mathematical operations 2021 minus 3 minus 22.
If you want that item to be stored as a string that resembles a date you will need to mark them as string literal datatype (str), as shown below by encapsulating them with quotes.
import pandas as pd
df1 = pd.DataFrame({"date": ['2021-3-22', '2021-4-7', '2021-4-18', '2021-5-12'],
"x": [3, 3, 3, 0 ]})
The results for the date column, as shown here indicate that the date column contains elements of the object datatype which encompasses str in pandas. Notice that the strings were created exactly as shown (2021-3-22 instead of 2021-03-22).
0 2021-3-22
1 2021-4-7
2 2021-4-18
3 2021-5-12
Name: date, dtype: object
IF however, you actually want them stored as datetime objects so that you can do datetime manipulations on them (i.e. determine the number of days between two dates OR filter by a specific month OR year) then you need to convert the values to datetime objects.
This technique will do that:
df1['date'] = pd.to_datetime(df1['date'])
The results of this conversion are Pandas datetime objects which enable nanosecond precision (I differentiate this from Python datetime objects which are limited to microsecond precision).
0 2021-03-22
1 2021-04-07
2 2021-04-18
3 2021-05-12
Name: date, dtype: datetime64[ns]
Notice the displayed results are now formatted just as you would expect of datetimes (2021-03-22 instead of 2021-3-22).
You would want to create the series as a datetime and use the following codes when doing so as strings, more info here pandas.to_datetime:
df1 = pd.DataFrame({"date": pd.to_datetime(["2021-3-22", "2021-4-7", "2021-4-18", "2021-5-12"]),
"x": [3, 3, 3, 0 ]})
FWIW, I often use pd.read_csv(io.StringIO(text)) to copy/paste tabular-looking data into a DataFrame (for example, from SO questions).
Example:
import io
import re
import pandas as pd
def df_read(txt, **kwargs):
txt = '\n'.join([s.strip() for s in txt.splitlines()])
return pd.read_csv(io.StringIO(re.sub(r' +', '\t', txt)), sep='\t', **kwargs)
txt = """
date value
2021-03-22 3
2021-04-07 3
2021-04-18 3
2021-05-12 0
"""
df = df_read(txt, parse_dates=['date'])
>>> df
date value
0 2021-03-22 3
1 2021-04-07 3
2 2021-04-18 3
3 2021-05-12 0
>>> df.dtypes
date datetime64[ns]
value int64
dtype: object

How to change str to date when year data inconsistent?

I've got a dataframe with a column names birthdates, they are all strings, most are saved as %d.%m.%Y, some are saved as %d.%m.%y.
How can I make this work?
df["birthdates_clean"] = pd.to_datetime(df["birthdates"], format = "%d.%m.%Y")
If this can't work, do I need to filter the rows? How would I do it?
Thanks for taking time to answer!
I am not sure what is the expected output, but you can let to_datetime parse automatically the dates:
df = pd.DataFrame({"birthdates": ['01.01.2000', '01.02.00', '02.03.99',
'02.03.22', '01.01.71', '01.01.72']})
# as datetime
df["birthdates_clean"] = pd.to_datetime(df["birthdates"], dayfirst=True)
# as custom string
df["birthdates_clean2"] = (pd.to_datetime(df["birthdates"], dayfirst=True)
.dt.strftime('%d.%m.%Y')
)
NB. the shift point is currently at 71/72. 71 gets evaluated as 2071 and 72 as 1972
output:
birthdates birthdates_clean birthdates_clean2
0 01.01.2000 2000-01-01 01.01.2000
1 01.02.00 2000-02-01 01.02.2000
2 02.03.99 1999-03-02 02.03.1999
3 02.03.22 2022-03-02 02.03.2022
4 01.01.71 2071-01-01 01.01.2071
5 01.01.72 1972-01-01 01.01.1972

Is there any Python code to help me replace the years of every date by 2022

I have a pandas dataframe column named disbursal_date which is a datetime:
disbursal_date
2009-01-28
2008-01-03
2008-07-15
and so on...
I want to keep the date and month part and replace the years by 2022 for all values.
I tried using df['disbursal_date'].map(lambda x: x.replace(year=2022)) but this didn't work for me.
You need to use apply not map to run a python function on a dataframe columns.
We need to make sure that the dtype is datetime of pandas and not object or string.
Below is the sample code I tried and it works fine, it replaces the year to 2022.
df = pd.DataFrame(['2009-01-28', '2008-01-03', '2008-07-15'],columns=['disbursal_old'])
df['disbursal_old'] = df['disbursal_old'].astype('datetime64[ns]')
df['disbursal_new'] = df['disbursal_old'].apply(lambda x : x.replace(year=2022))
print(df['disbursal_new'])
0 2022-01-28
1 2022-01-03
2 2022-07-15
Name: disbursal_new, dtype: datetime64[ns]
The below code gives the difference between the years.
df['disbursal_diff_year'] = df['disbursal_new'].dt.year - df['disbursal_old'].dt.year
print(df)
disbursal_old disbursal_new disbursal_diff_year
0 2009-01-28 2022-01-28 13
1 2008-01-03 2022-01-03 14
2 2008-07-15 2022-07-15 14

How to work around the date range limit in Pandas for plotting?

sorry if this question has been asked before but I can't seem to find one that describes my current issue.
Basically, I have a large climate dataset that is not bound to "real" dates. The dataset starts at "year one" and goes to "year 9999". These dates are stored as strings such as Jan-01, Feb-01, Mar-01 etc, where the number indicates the year. When trying to convert this column to date time objects, I get an out of range error. (My reading into this suggests this is due to a 64bit limit on the possible datetime timestamps that can exist)
What is a good way to work around this problem/process the date information so I can effectively plot the associated data vs these dates, over this ~10,000 year period?
Thanks
the cftime library was created specifically for this purpose, and xarray has a convenient xr.cftime_range function that makes creating such a range easy:
In [3]: import xarray as xr, pandas as pd
In [4]: date_range = xr.cftime_range('0001-01-01', '9999-01-01', freq='D')
In [5]: type(date_range)
Out[5]: xarray.coding.cftimeindex.CFTimeIndex
This creates a CFTimeIndex object which plays nicely with pandas:
In [8]: df = pd.DataFrame({"date": date_range, "vals": range(len(date_range))})
In [9]: df
Out[9]:
date vals
0 0001-01-01 00:00:00 0
1 0001-01-02 00:00:00 1
2 0001-01-03 00:00:00 2
3 0001-01-04 00:00:00 3
4 0001-01-05 00:00:00 4
... ... ...
3651692 9998-12-28 00:00:00 3651692
3651693 9998-12-29 00:00:00 3651693
3651694 9998-12-30 00:00:00 3651694
3651695 9998-12-31 00:00:00 3651695
3651696 9999-01-01 00:00:00 3651696
[3651697 rows x 2 columns]

How to get the number of business days between two dates in pandas

I have the following column in a dataframe, I would like to add a column to the end of this dataframe, where the column has the business days from today (6/24) to the previous day.
Bday() function does not seem to have this capability.
Date
2019-6-21
2019-6-20
2019-6-14
I am looking for a result that looks like following:
Date Business days
2019-6-21 1
2019-6-20 2
2019-6-14 6
Is there an easy way to do this, other than doing individual manipulations or using datetime library
Use np.busday_count:
# df['Date'] = pd.to_datetime(df['Date']) # if needed
np.busday_count(df['Date'].dt.date, np.datetime64('today'))
# array([1, 2, 6])
df['bdays'] = np.busday_count(df['Date'].dt.date, np.datetime64('today'))
df
Date bdays
0 2019-06-21 1
1 2019-06-20 2
2 2019-06-14 6

Categories

Resources