Reading CSV file in Pandas with historical dates - python

I'm trying to read a file in with dates in the (UK) format 13/01/1800, however some of the dates are before 1667, which cannot be represented by the nanosecond timestamp (see http://pandas.pydata.org/pandas-docs/stable/gotchas.html#gotchas-timestamp-limits). I understand from that page I need to create my own PeriodIndex to cover the range I need (see http://pandas.pydata.org/pandas-docs/stable/timeseries.html#timeseries-oob) but I can't understand how I convert the string in the csv reader to a date in this periodindex.
So far I have:
span = pd.period_range('1000-01-01', '2100-01-01', freq='D')
df_earliest= pd.read_csv("objects.csv", index_col=0, names=['Object Id', 'Earliest Date'], parse_dates=[1], infer_datetime_format=True, dayfirst=True)
How do I apply the span to the date reader/converter so I can create a PeriodIndex / DateTimeIndex column in the dataframe ?

you can try to do it this way:
fn = r'D:\temp\.data\36987699.csv'
def dt_parse(s):
d,m,y = s.split('/')
return pd.Period(year=int(y), month=int(m), day=int(d), freq='D')
df = pd.read_csv(fn, parse_dates=[0], date_parser=dt_parse)
Input file:
Date,col1
13/01/1800,aaa
25/12/1001,bbb
01/03/1267,ccc
Test:
In [16]: df
Out[16]:
Date col1
0 1800-01-13 aaa
1 1001-12-25 bbb
2 1267-03-01 ccc
In [17]: df.dtypes
Out[17]:
Date object
col1 object
dtype: object
In [18]: df['Date'].dt.year
Out[18]:
0 1800
1 1001
2 1267
Name: Date, dtype: int64
PS you may want to add try ... catch block in the dt_parse() function for catching ValueError: exceptions - result of int()...

Related

Pandas - Problem reading a datetime64 object from a json file

I'm trying to save a Pandas data frame as a JSON file. One of the columns has the type datetime64[ns]. When I print the column the correct date is displayed. However, when I save it as a JSON file and read it back later the date value changes even after I change the column type back to datetime64[ns]. Here is a sample app which shows the issue I am running into:
#!/usr/bin/python3
import json
import pandas as pd
s = pd.Series(['1/1/2000'])
in_df = pd.DataFrame(s)
in_df[0] = pd.to_datetime(in_df[0], format='%m/%d/%Y')
print("in_df\n")
print(in_df)
print("\nin_df dtypes\n")
print(in_df.dtypes)
in_df.to_json("test.json")
out_df = pd.read_json("test.json")
out_df[0] = out_df[0].astype('datetime64[ns]')
print("\nout_df\n")
print(out_df)
print("\nout_df dtypes\n")
print(out_df.dtypes)
Here is the output:
in_df
0
0 2000-01-01
in_df dtypes
0 datetime64[ns]
dtype: object
out_df
0
0 1970-01-01 00:15:46.684800 <--- Why don't I get 2000-1-1 here?
out_df dtypes
0 datetime64[ns]
dtype: object
I'm expecting the get the original date displayed (2000-1-1) when I read back the JSON file. What am I doing wrong with my conversion? Thanks!
df = pd.read_json("test.json")
df[0] = pd.to_datetime(df[0], unit='ms')
print("\ndf\n")
print(df)
print("\ndf dtypes\n")
print(df.dtypes)
will give you
df
0
0 2000-01-01
df dtypes
0 datetime64[ns]
dtype: object
This should work for all millisecond json columns you need

Pandas convert partial column Index to Datetime

DataFrame below contains housing price dataset from 1996 to 2016.
Other than the first 6 columns, other columns need to be converted to Datetime type.
I tried to run the following code:
HousingPrice.columns[6:] = pd.to_datetime(HousingPrice.columns[6:])
but I got the error:
TypeError: Index does not support mutable operations
I wish to convert some columns in the columns Index to Datetime type, but not all columns.
The pandas index is immutable, so you can't do that.
However, you can access and modify the column index with array, see doc here.
HousingPrice.columns.array[6:] = pd.to_datetime(HousingPrice.columns[6:])
should work.
Note that this would change the column index only. In order to convert the columns values, you can do this :
date_cols = HousingPrice.columns[6:]
HousingPrice[date_cols] = HousingPrice[date_cols].apply(pd.to_datetime, errors='coerce', axis=1)
EDIT
Illustrated example:
data = {'0ther_col': [1,2,3], '1996-04': ['1996-04','1996-05','1996-06'], '1995-05':['1996-02','1996-08','1996-10']}
print('ORIGINAL DATAFRAME')
df = pd.DataFrame.from_records(data)
print(df)
print("\nDATE COLUMNS")
date_cols = df.columns[-2:]
print(df.dtypes)
print('\nCASTING DATE COLUMNS TO DATETIME')
df[date_cols] = df[date_cols].apply(pd.to_datetime, errors='coerce', axis=1)
print(df.dtypes)
print('\nCASTING DATE COLUMN INDEXES TO DATETIME')
print("OLD INDEX -", df.columns)
df.columns.array[-2:] = pd.to_datetime(df[date_cols].columns)
print("NEW INDEX -",df.columns)
print('\nFINAL DATAFRAME')
print(df)
yields:
ORIGINAL DATAFRAME
0ther_col 1995-05 1996-04
0 1 1996-02 1996-04
1 2 1996-08 1996-05
2 3 1996-10 1996-06
DATE COLUMNS
0ther_col int64
1995-05 object
1996-04 object
dtype: object
CASTING DATE COLUMNS TO DATETIME
0ther_col int64
1995-05 datetime64[ns]
1996-04 datetime64[ns]
dtype: object
CASTING DATE COLUMN INDEXES TO DATETIME
OLD INDEX - Index(['0ther_col', '1995-05', '1996-04'], dtype='object')
NEW INDEX - Index(['0ther_col', 1995-05-01 00:00:00, 1996-04-01 00:00:00], dtype='object')
FINAL DATAFRAME
0ther_col 1995-05-01 00:00:00 1996-04-01 00:00:00
0 1 1996-02-01 1996-04-01
1 2 1996-08-01 1996-05-01
2 3 1996-10-01 1996-06-01

Pandas: how to change only one column which is in a series contain same column name [duplicate]

I have a Dataframe, df, with the following column:
df['ArrivalDate'] =
...
936 2012-12-31
938 2012-12-29
965 2012-12-31
966 2012-12-31
967 2012-12-31
968 2012-12-31
969 2012-12-31
970 2012-12-29
971 2012-12-31
972 2012-12-29
973 2012-12-29
...
The elements of the column are pandas.tslib.Timestamp.
I want to just include the year and month. I thought there would be simple way to do it, but I can't figure it out.
Here's what I've tried:
df['ArrivalDate'].resample('M', how = 'mean')
I got the following error:
Only valid with DatetimeIndex or PeriodIndex
Then I tried:
df['ArrivalDate'].apply(lambda(x):x[:-2])
I got the following error:
'Timestamp' object has no attribute '__getitem__'
Any suggestions?
Edit: I sort of figured it out.
df.index = df['ArrivalDate']
Then, I can resample another column using the index.
But I'd still like a method for reconfiguring the entire column. Any ideas?
If you want new columns showing year and month separately you can do this:
df['year'] = pd.DatetimeIndex(df['ArrivalDate']).year
df['month'] = pd.DatetimeIndex(df['ArrivalDate']).month
or...
df['year'] = df['ArrivalDate'].dt.year
df['month'] = df['ArrivalDate'].dt.month
Then you can combine them or work with them just as they are.
The df['date_column'] has to be in date time format.
df['month_year'] = df['date_column'].dt.to_period('M')
You could also use D for Day, 2M for 2 Months etc. for different sampling intervals, and in case one has time series data with time stamp, we can go for granular sampling intervals such as 45Min for 45 min, 15Min for 15 min sampling etc.
You can directly access the year and month attributes, or request a datetime.datetime:
In [15]: t = pandas.tslib.Timestamp.now()
In [16]: t
Out[16]: Timestamp('2014-08-05 14:49:39.643701', tz=None)
In [17]: t.to_pydatetime() #datetime method is deprecated
Out[17]: datetime.datetime(2014, 8, 5, 14, 49, 39, 643701)
In [18]: t.day
Out[18]: 5
In [19]: t.month
Out[19]: 8
In [20]: t.year
Out[20]: 2014
One way to combine year and month is to make an integer encoding them, such as: 201408 for August, 2014. Along a whole column, you could do this as:
df['YearMonth'] = df['ArrivalDate'].map(lambda x: 100*x.year + x.month)
or many variants thereof.
I'm not a big fan of doing this, though, since it makes date alignment and arithmetic painful later and especially painful for others who come upon your code or data without this same convention. A better way is to choose a day-of-month convention, such as final non-US-holiday weekday, or first day, etc., and leave the data in a date/time format with the chosen date convention.
The calendar module is useful for obtaining the number value of certain days such as the final weekday. Then you could do something like:
import calendar
import datetime
df['AdjustedDateToEndOfMonth'] = df['ArrivalDate'].map(
lambda x: datetime.datetime(
x.year,
x.month,
max(calendar.monthcalendar(x.year, x.month)[-1][:5])
)
)
If you happen to be looking for a way to solve the simpler problem of just formatting the datetime column into some stringified representation, for that you can just make use of the strftime function from the datetime.datetime class, like this:
In [5]: df
Out[5]:
date_time
0 2014-10-17 22:00:03
In [6]: df.date_time
Out[6]:
0 2014-10-17 22:00:03
Name: date_time, dtype: datetime64[ns]
In [7]: df.date_time.map(lambda x: x.strftime('%Y-%m-%d'))
Out[7]:
0 2014-10-17
Name: date_time, dtype: object
If you want the month year unique pair, using apply is pretty sleek.
df['mnth_yr'] = df['date_column'].apply(lambda x: x.strftime('%B-%Y'))
Outputs month-year in one column.
Don't forget to first change the format to date-time before, I generally forget.
df['date_column'] = pd.to_datetime(df['date_column'])
SINGLE LINE: Adding a column with 'year-month'-paires:
('pd.to_datetime' first changes the column dtype to date-time before the operation)
df['yyyy-mm'] = pd.to_datetime(df['ArrivalDate']).dt.strftime('%Y-%m')
Accordingly for an extra 'year' or 'month' column:
df['yyyy'] = pd.to_datetime(df['ArrivalDate']).dt.strftime('%Y')
df['mm'] = pd.to_datetime(df['ArrivalDate']).dt.strftime('%m')
Extracting the Year say from ['2018-03-04']
df['Year'] = pd.DatetimeIndex(df['date']).year
The df['Year'] creates a new column. While if you want to extract the month just use .month
You can first convert your date strings with pandas.to_datetime, which gives you access to all of the numpy datetime and timedelta facilities. For example:
df['ArrivalDate'] = pandas.to_datetime(df['ArrivalDate'])
df['Month'] = df['ArrivalDate'].values.astype('datetime64[M]')
#KieranPC's solution is the correct approach for Pandas, but is not easily extendible for arbitrary attributes. For this, you can use getattr within a generator comprehension and combine using pd.concat:
# input data
list_of_dates = ['2012-12-31', '2012-12-29', '2012-12-30']
df = pd.DataFrame({'ArrivalDate': pd.to_datetime(list_of_dates)})
# define list of attributes required
L = ['year', 'month', 'day', 'dayofweek', 'dayofyear', 'weekofyear', 'quarter']
# define generator expression of series, one for each attribute
date_gen = (getattr(df['ArrivalDate'].dt, i).rename(i) for i in L)
# concatenate results and join to original dataframe
df = df.join(pd.concat(date_gen, axis=1))
print(df)
ArrivalDate year month day dayofweek dayofyear weekofyear quarter
0 2012-12-31 2012 12 31 0 366 1 4
1 2012-12-29 2012 12 29 5 364 52 4
2 2012-12-30 2012 12 30 6 365 52 4
Thanks to jaknap32, I wanted to aggregate the results according to Year and Month, so this worked:
df_join['YearMonth'] = df_join['timestamp'].apply(lambda x:x.strftime('%Y%m'))
Output was neat:
0 201108
1 201108
2 201108
There is two steps to extract year for all the dataframe without using method apply.
Step1
convert the column to datetime :
df['ArrivalDate']=pd.to_datetime(df['ArrivalDate'], format='%Y-%m-%d')
Step2
extract the year or the month using DatetimeIndex() method
pd.DatetimeIndex(df['ArrivalDate']).year
df['Month_Year'] = df['Date'].dt.to_period('M')
Result :
Date Month_Year
0 2020-01-01 2020-01
1 2020-01-02 2020-01
2 2020-01-03 2020-01
3 2020-01-04 2020-01
4 2020-01-05 2020-01
df['year_month']=df.datetime_column.apply(lambda x: str(x)[:7])
This worked fine for me, didn't think pandas would interpret the resultant string date as date, but when i did the plot, it knew very well my agenda and the string year_month where ordered properly... gotta love pandas!
Then I tried:
df['ArrivalDate'].apply(lambda(x):x[:-2])
I think here the proper input should be string.
df['ArrivalDate'].astype(str).apply(lambda(x):x[:-2])

Extract Day, Month and Hour from Timestamp string in Python [duplicate]

I have a Dataframe, df, with the following column:
df['ArrivalDate'] =
...
936 2012-12-31
938 2012-12-29
965 2012-12-31
966 2012-12-31
967 2012-12-31
968 2012-12-31
969 2012-12-31
970 2012-12-29
971 2012-12-31
972 2012-12-29
973 2012-12-29
...
The elements of the column are pandas.tslib.Timestamp.
I want to just include the year and month. I thought there would be simple way to do it, but I can't figure it out.
Here's what I've tried:
df['ArrivalDate'].resample('M', how = 'mean')
I got the following error:
Only valid with DatetimeIndex or PeriodIndex
Then I tried:
df['ArrivalDate'].apply(lambda(x):x[:-2])
I got the following error:
'Timestamp' object has no attribute '__getitem__'
Any suggestions?
Edit: I sort of figured it out.
df.index = df['ArrivalDate']
Then, I can resample another column using the index.
But I'd still like a method for reconfiguring the entire column. Any ideas?
If you want new columns showing year and month separately you can do this:
df['year'] = pd.DatetimeIndex(df['ArrivalDate']).year
df['month'] = pd.DatetimeIndex(df['ArrivalDate']).month
or...
df['year'] = df['ArrivalDate'].dt.year
df['month'] = df['ArrivalDate'].dt.month
Then you can combine them or work with them just as they are.
The df['date_column'] has to be in date time format.
df['month_year'] = df['date_column'].dt.to_period('M')
You could also use D for Day, 2M for 2 Months etc. for different sampling intervals, and in case one has time series data with time stamp, we can go for granular sampling intervals such as 45Min for 45 min, 15Min for 15 min sampling etc.
You can directly access the year and month attributes, or request a datetime.datetime:
In [15]: t = pandas.tslib.Timestamp.now()
In [16]: t
Out[16]: Timestamp('2014-08-05 14:49:39.643701', tz=None)
In [17]: t.to_pydatetime() #datetime method is deprecated
Out[17]: datetime.datetime(2014, 8, 5, 14, 49, 39, 643701)
In [18]: t.day
Out[18]: 5
In [19]: t.month
Out[19]: 8
In [20]: t.year
Out[20]: 2014
One way to combine year and month is to make an integer encoding them, such as: 201408 for August, 2014. Along a whole column, you could do this as:
df['YearMonth'] = df['ArrivalDate'].map(lambda x: 100*x.year + x.month)
or many variants thereof.
I'm not a big fan of doing this, though, since it makes date alignment and arithmetic painful later and especially painful for others who come upon your code or data without this same convention. A better way is to choose a day-of-month convention, such as final non-US-holiday weekday, or first day, etc., and leave the data in a date/time format with the chosen date convention.
The calendar module is useful for obtaining the number value of certain days such as the final weekday. Then you could do something like:
import calendar
import datetime
df['AdjustedDateToEndOfMonth'] = df['ArrivalDate'].map(
lambda x: datetime.datetime(
x.year,
x.month,
max(calendar.monthcalendar(x.year, x.month)[-1][:5])
)
)
If you happen to be looking for a way to solve the simpler problem of just formatting the datetime column into some stringified representation, for that you can just make use of the strftime function from the datetime.datetime class, like this:
In [5]: df
Out[5]:
date_time
0 2014-10-17 22:00:03
In [6]: df.date_time
Out[6]:
0 2014-10-17 22:00:03
Name: date_time, dtype: datetime64[ns]
In [7]: df.date_time.map(lambda x: x.strftime('%Y-%m-%d'))
Out[7]:
0 2014-10-17
Name: date_time, dtype: object
If you want the month year unique pair, using apply is pretty sleek.
df['mnth_yr'] = df['date_column'].apply(lambda x: x.strftime('%B-%Y'))
Outputs month-year in one column.
Don't forget to first change the format to date-time before, I generally forget.
df['date_column'] = pd.to_datetime(df['date_column'])
SINGLE LINE: Adding a column with 'year-month'-paires:
('pd.to_datetime' first changes the column dtype to date-time before the operation)
df['yyyy-mm'] = pd.to_datetime(df['ArrivalDate']).dt.strftime('%Y-%m')
Accordingly for an extra 'year' or 'month' column:
df['yyyy'] = pd.to_datetime(df['ArrivalDate']).dt.strftime('%Y')
df['mm'] = pd.to_datetime(df['ArrivalDate']).dt.strftime('%m')
Extracting the Year say from ['2018-03-04']
df['Year'] = pd.DatetimeIndex(df['date']).year
The df['Year'] creates a new column. While if you want to extract the month just use .month
You can first convert your date strings with pandas.to_datetime, which gives you access to all of the numpy datetime and timedelta facilities. For example:
df['ArrivalDate'] = pandas.to_datetime(df['ArrivalDate'])
df['Month'] = df['ArrivalDate'].values.astype('datetime64[M]')
#KieranPC's solution is the correct approach for Pandas, but is not easily extendible for arbitrary attributes. For this, you can use getattr within a generator comprehension and combine using pd.concat:
# input data
list_of_dates = ['2012-12-31', '2012-12-29', '2012-12-30']
df = pd.DataFrame({'ArrivalDate': pd.to_datetime(list_of_dates)})
# define list of attributes required
L = ['year', 'month', 'day', 'dayofweek', 'dayofyear', 'weekofyear', 'quarter']
# define generator expression of series, one for each attribute
date_gen = (getattr(df['ArrivalDate'].dt, i).rename(i) for i in L)
# concatenate results and join to original dataframe
df = df.join(pd.concat(date_gen, axis=1))
print(df)
ArrivalDate year month day dayofweek dayofyear weekofyear quarter
0 2012-12-31 2012 12 31 0 366 1 4
1 2012-12-29 2012 12 29 5 364 52 4
2 2012-12-30 2012 12 30 6 365 52 4
Thanks to jaknap32, I wanted to aggregate the results according to Year and Month, so this worked:
df_join['YearMonth'] = df_join['timestamp'].apply(lambda x:x.strftime('%Y%m'))
Output was neat:
0 201108
1 201108
2 201108
There is two steps to extract year for all the dataframe without using method apply.
Step1
convert the column to datetime :
df['ArrivalDate']=pd.to_datetime(df['ArrivalDate'], format='%Y-%m-%d')
Step2
extract the year or the month using DatetimeIndex() method
pd.DatetimeIndex(df['ArrivalDate']).year
df['Month_Year'] = df['Date'].dt.to_period('M')
Result :
Date Month_Year
0 2020-01-01 2020-01
1 2020-01-02 2020-01
2 2020-01-03 2020-01
3 2020-01-04 2020-01
4 2020-01-05 2020-01
df['year_month']=df.datetime_column.apply(lambda x: str(x)[:7])
This worked fine for me, didn't think pandas would interpret the resultant string date as date, but when i did the plot, it knew very well my agenda and the string year_month where ordered properly... gotta love pandas!
Then I tried:
df['ArrivalDate'].apply(lambda(x):x[:-2])
I think here the proper input should be string.
df['ArrivalDate'].astype(str).apply(lambda(x):x[:-2])

how to convert sting of different date format in to single date format in python?

i have some of my date as 26-07-10 and others as 4/8/2010 as string type in a csv. i want them to be in single format like 4/8/2010 so that i can parse them and group them each year. Is there a function in python or pandas help me?
You can parse these date forms using parse_dates param of read_csv note however for ambiguous forms it may fail for instance if you gave month first forms mixed with day first:
In [7]:
t="""date
26-07-10
4/8/2010"""
df = pd.read_csv(io.StringIO(t), parse_dates=[0])
df
Out[7]:
date
0 2010-07-26
1 2010-04-08
You can alter the displayed format by changing the string format using dt.strftime:
In [10]:
df['date'].dt.strftime('%d/%m/%Y')
Out[10]:
0 26/07/2010
1 08/04/2010
Name: date, dtype: object
Really though it's better to keep the column as a datetime you can then groupby on year:
In [11]:
t="""date,val
26-07-10,23
4/8/2010,5567"""
df = pd.read_csv(io.StringIO(t), parse_dates=[0])
df
Out[11]:
date val
0 2010-07-26 23
1 2010-04-08 5567
In [12]:
df.groupby(df['date'].dt.year).mean()
Out[12]:
val
date
2010 2795
You can try using parse-date parameter of pd.read_csv() as mentioned by #EdChum
Alternatively, you could typecast them to a standard format, like that of datetime.date as follows:
import io
import datetime
t=u"""date
26-07-10
4/8/2010"""
df = pd.read_csv(io.StringIO(t), parse_dates=[0])
df.date.astype(datetime.date)
df
out:
date
0 2010-07-26
1 2010-04-08

Categories

Resources