Pandas - Problem reading a datetime64 object from a json file - python

I'm trying to save a Pandas data frame as a JSON file. One of the columns has the type datetime64[ns]. When I print the column the correct date is displayed. However, when I save it as a JSON file and read it back later the date value changes even after I change the column type back to datetime64[ns]. Here is a sample app which shows the issue I am running into:
#!/usr/bin/python3
import json
import pandas as pd
s = pd.Series(['1/1/2000'])
in_df = pd.DataFrame(s)
in_df[0] = pd.to_datetime(in_df[0], format='%m/%d/%Y')
print("in_df\n")
print(in_df)
print("\nin_df dtypes\n")
print(in_df.dtypes)
in_df.to_json("test.json")
out_df = pd.read_json("test.json")
out_df[0] = out_df[0].astype('datetime64[ns]')
print("\nout_df\n")
print(out_df)
print("\nout_df dtypes\n")
print(out_df.dtypes)
Here is the output:
in_df
0
0 2000-01-01
in_df dtypes
0 datetime64[ns]
dtype: object
out_df
0
0 1970-01-01 00:15:46.684800 <--- Why don't I get 2000-1-1 here?
out_df dtypes
0 datetime64[ns]
dtype: object
I'm expecting the get the original date displayed (2000-1-1) when I read back the JSON file. What am I doing wrong with my conversion? Thanks!

df = pd.read_json("test.json")
df[0] = pd.to_datetime(df[0], unit='ms')
print("\ndf\n")
print(df)
print("\ndf dtypes\n")
print(df.dtypes)
will give you
df
0
0 2000-01-01
df dtypes
0 datetime64[ns]
dtype: object
This should work for all millisecond json columns you need

Related

Pandas convert partial column Index to Datetime

DataFrame below contains housing price dataset from 1996 to 2016.
Other than the first 6 columns, other columns need to be converted to Datetime type.
I tried to run the following code:
HousingPrice.columns[6:] = pd.to_datetime(HousingPrice.columns[6:])
but I got the error:
TypeError: Index does not support mutable operations
I wish to convert some columns in the columns Index to Datetime type, but not all columns.
The pandas index is immutable, so you can't do that.
However, you can access and modify the column index with array, see doc here.
HousingPrice.columns.array[6:] = pd.to_datetime(HousingPrice.columns[6:])
should work.
Note that this would change the column index only. In order to convert the columns values, you can do this :
date_cols = HousingPrice.columns[6:]
HousingPrice[date_cols] = HousingPrice[date_cols].apply(pd.to_datetime, errors='coerce', axis=1)
EDIT
Illustrated example:
data = {'0ther_col': [1,2,3], '1996-04': ['1996-04','1996-05','1996-06'], '1995-05':['1996-02','1996-08','1996-10']}
print('ORIGINAL DATAFRAME')
df = pd.DataFrame.from_records(data)
print(df)
print("\nDATE COLUMNS")
date_cols = df.columns[-2:]
print(df.dtypes)
print('\nCASTING DATE COLUMNS TO DATETIME')
df[date_cols] = df[date_cols].apply(pd.to_datetime, errors='coerce', axis=1)
print(df.dtypes)
print('\nCASTING DATE COLUMN INDEXES TO DATETIME')
print("OLD INDEX -", df.columns)
df.columns.array[-2:] = pd.to_datetime(df[date_cols].columns)
print("NEW INDEX -",df.columns)
print('\nFINAL DATAFRAME')
print(df)
yields:
ORIGINAL DATAFRAME
0ther_col 1995-05 1996-04
0 1 1996-02 1996-04
1 2 1996-08 1996-05
2 3 1996-10 1996-06
DATE COLUMNS
0ther_col int64
1995-05 object
1996-04 object
dtype: object
CASTING DATE COLUMNS TO DATETIME
0ther_col int64
1995-05 datetime64[ns]
1996-04 datetime64[ns]
dtype: object
CASTING DATE COLUMN INDEXES TO DATETIME
OLD INDEX - Index(['0ther_col', '1995-05', '1996-04'], dtype='object')
NEW INDEX - Index(['0ther_col', 1995-05-01 00:00:00, 1996-04-01 00:00:00], dtype='object')
FINAL DATAFRAME
0ther_col 1995-05-01 00:00:00 1996-04-01 00:00:00
0 1 1996-02-01 1996-04-01
1 2 1996-08-01 1996-05-01
2 3 1996-10-01 1996-06-01

Convert SQL timestamp column to date format column of Python dataframe

I have data upload in MS Excel format.
enter image description here
This file has a column with dates in "dd.mm.yyyy 00:00:00" format.
Reading file with code:
df = pd.read_excel('data_from_db.xlsx')
I recieve a frame, where dates column has "object" type. Further I convert this column to date format by command:
df['Date_Column'] = pd.to_datetime(df['Date_Column'])
That gives me "datetime64[ns]" type.
But this command does not work correctly each time. I meet rows with muddled data:
somewhere rows have format "yyyy.mm.dd",
somwhere "yyyy.dd.mm".
How should I correctly convert excel column with "dd.mm.yyyy 00:00:00" format to column in pandas dataframe with date type and "dd.mm.yyyy" fromat?
P.S. Also, I noticed this oddity: some values in raw date column have str type, another - float. But I can't wrap my head around it, because raw table is an upload from database.
Without specifying a format, pd.to_datetime has to guess from the data how a date string is to be interpreted. With default parameters this fails for the second and third row of your data:
In [5]: date_of_hire = pd.Series(['18.01.2018 0:00:00',
'01.02.2018 0:00:00',
'06.11.2018 0:00:00'])
In [6]: pd.to_datetime(date_of_hire)
Out[6]:
0 2018-01-18
1 2018-01-02
2 2018-06-11
dtype: datetime64[ns]
The quickest solution would be to pass dayfirst=True:
In [7]: pd.to_datetime(date_of_hire, dayfirst=True)
Out[7]:
0 2018-01-18
1 2018-02-01
2 2018-11-06
dtype: datetime64[ns]
If you know the complete format of your data, can specify it directly. This only works if the format is exactly like given, if a row should e.g. lack the time the conversion will fail.
In [8]: pd.to_datetime(date_of_hire, format='%d.%m.%Y %H:%M:%S')
Out[8]:
0 2018-01-18
1 2018-02-01
2 2018-11-06
dtype: datetime64[ns]
In case you should have little information about the date format, except for it being consistent, pandas has the ability to infer the format from the data beforehand:
In [9]: pd.to_datetime(date_of_hire, infer_datetime_format=True)
Out[9]:
0 2018-01-18
1 2018-02-01
2 2018-11-06
dtype: datetime64[ns]

Python weekday drop from DataFrame

I try to drop the weekdays from a dataframe (financial time series) and I keep getting the following error:
"AttributeError: 'Series' object has no attribute 'weekday'"
Here is my code:
df = df[df.date.weekday() < 5]
df = df.drop(df.date.weekday() < 5)
I tried a few others but nothing seemed to work.
I looked at dtypes and this is what I get:
Unnamed: 0 int64
close float32
date object
high float64
low float64
open float64
quoteVolume float64
volume float64
weightedAverage float64
dtype: object
So date is an object, but I can't transform it to datetime, I tried these:
df['date'] = df.date.astype('date')
df['date'] = df.date.astype('datetime')
both gave me the error:
TypeError: data type "date" not understood
The time format of the Series is: 2016-09-23 17:00:00 so yyyy-MM-dd hh:mm:ss.
Use pd.to_datetime:
import pandas as pd
df = df[pd.to_datetime(df.date).dt.weekday < 5]

Reading CSV file in Pandas with historical dates

I'm trying to read a file in with dates in the (UK) format 13/01/1800, however some of the dates are before 1667, which cannot be represented by the nanosecond timestamp (see http://pandas.pydata.org/pandas-docs/stable/gotchas.html#gotchas-timestamp-limits). I understand from that page I need to create my own PeriodIndex to cover the range I need (see http://pandas.pydata.org/pandas-docs/stable/timeseries.html#timeseries-oob) but I can't understand how I convert the string in the csv reader to a date in this periodindex.
So far I have:
span = pd.period_range('1000-01-01', '2100-01-01', freq='D')
df_earliest= pd.read_csv("objects.csv", index_col=0, names=['Object Id', 'Earliest Date'], parse_dates=[1], infer_datetime_format=True, dayfirst=True)
How do I apply the span to the date reader/converter so I can create a PeriodIndex / DateTimeIndex column in the dataframe ?
you can try to do it this way:
fn = r'D:\temp\.data\36987699.csv'
def dt_parse(s):
d,m,y = s.split('/')
return pd.Period(year=int(y), month=int(m), day=int(d), freq='D')
df = pd.read_csv(fn, parse_dates=[0], date_parser=dt_parse)
Input file:
Date,col1
13/01/1800,aaa
25/12/1001,bbb
01/03/1267,ccc
Test:
In [16]: df
Out[16]:
Date col1
0 1800-01-13 aaa
1 1001-12-25 bbb
2 1267-03-01 ccc
In [17]: df.dtypes
Out[17]:
Date object
col1 object
dtype: object
In [18]: df['Date'].dt.year
Out[18]:
0 1800
1 1001
2 1267
Name: Date, dtype: int64
PS you may want to add try ... catch block in the dt_parse() function for catching ValueError: exceptions - result of int()...

how to convert sting of different date format in to single date format in python?

i have some of my date as 26-07-10 and others as 4/8/2010 as string type in a csv. i want them to be in single format like 4/8/2010 so that i can parse them and group them each year. Is there a function in python or pandas help me?
You can parse these date forms using parse_dates param of read_csv note however for ambiguous forms it may fail for instance if you gave month first forms mixed with day first:
In [7]:
t="""date
26-07-10
4/8/2010"""
df = pd.read_csv(io.StringIO(t), parse_dates=[0])
df
Out[7]:
date
0 2010-07-26
1 2010-04-08
You can alter the displayed format by changing the string format using dt.strftime:
In [10]:
df['date'].dt.strftime('%d/%m/%Y')
Out[10]:
0 26/07/2010
1 08/04/2010
Name: date, dtype: object
Really though it's better to keep the column as a datetime you can then groupby on year:
In [11]:
t="""date,val
26-07-10,23
4/8/2010,5567"""
df = pd.read_csv(io.StringIO(t), parse_dates=[0])
df
Out[11]:
date val
0 2010-07-26 23
1 2010-04-08 5567
In [12]:
df.groupby(df['date'].dt.year).mean()
Out[12]:
val
date
2010 2795
You can try using parse-date parameter of pd.read_csv() as mentioned by #EdChum
Alternatively, you could typecast them to a standard format, like that of datetime.date as follows:
import io
import datetime
t=u"""date
26-07-10
4/8/2010"""
df = pd.read_csv(io.StringIO(t), parse_dates=[0])
df.date.astype(datetime.date)
df
out:
date
0 2010-07-26
1 2010-04-08

Categories

Resources