Parsing Pandas df Column of mixed data into Datetime - python

df = pd.DataFrame('23.Jan.2020 01.Mar.2017 5663:33 20.May.2021 626'.split())
I want to convert to date-like elements to datetime and for numbers, to return the original value.
I have tried
t=pd.to_datetime(df[0], format='%d.%b.%Y', errors='ignore')
which just returns to original df with no change. And I have tried to change errors to 'coerce', which does the conversion for date like elements, but numbers are dropped
t=pd.to_datetime(df[0], format='%d.%b.%Y', errors='coerce')
Then I attempt to return the original df value if NaT, else substitute with the new datetime from t
df.where(t.isnull(), other=t, axis=1)
Which works for returning the original df value where NaT, but it doesn't transfer the datetime

Maybe this is what you want?
dt = pd.Series('23.Jan.2020 01.Mar.2017 5663:33 20.May.2021 626'.split())
res = pd.to_datetime(dt, format="%d.%b.%Y", errors='coerce').fillna(dt)
This way the resulting elements in the series has the correct types:
>>> res.map(type)
0 <class 'pandas._libs.tslibs.timestamps.Timesta...
1 <class 'pandas._libs.tslibs.timestamps.Timesta...
2 <class 'str'>
3 <class 'pandas._libs.tslibs.timestamps.Timesta...
4 <class 'str'>
dtype: object
PS: I used a Series because it's easier to pass to to_datetime, and to Series.fillna.

this will combine the two field types in the way you have specified:
import pandas as pd
df = pd.DataFrame('23.Jan.2020 01.Mar.2017 5663:33 20.May.2021 626'.split())
mod = pd.to_datetime(df[0], format='%d.%b.%Y', errors='coerce')
ndf = pd.concat([df, mod], axis=1)
ndf.columns = ['original', 'modified']
def funk(col1,col2):
return col1 if pd.isnull(col2) else col2
ndf.apply(lambda x: funk(x.original,x.modified), axis=1)
# 0 2020-01-23 00:00:00
# 1 2017-03-01 00:00:00
# 2 5663:33
# 3 2021-05-20 00:00:00
# 4 626

Related

Set the correct datetime format with pandas

I have trouble setting the correct datime format with Pandas, I do not understand why my command does not work. Any solution?
date = ['01/10/2014 00:03:20']
value = [33.24]
df = pd.DataFrame({'value':value,'index':date})
df.index = pd.to_datetime(df.index,format='%d/%m/%y %H:%M:%S')
Solution for DatetimeIndex:
date = ['01/10/2014 00:03:20']
value = [33.24]
#create index by date list
df = pd.DataFrame({'value':value},index=date)
#use Y for match YYYY, y is for match YY years format
df.index = pd.to_datetime(df.index,format='%d/%m/%Y %H:%M:%S')
print (df)
value
2014-10-01 00:03:20 33.24
If want index column name is necessary use [] for avoid selecting RangeIndex:
df = pd.DataFrame({'value':value,'index':date})
df['index'] = pd.to_datetime(df['index'],format='%d/%m/%Y %H:%M:%S')
print (df)
value index
0 33.24 2014-10-01 00:03:20
Calling a column 'index' is a bit confusing, changed it to 'index_date'.
import pandas as pd
date = ['01/10/2014 00:03:20']
value = [33.24]
df = pd.DataFrame({'value':value,'index_date':date})
df['index_date'] = pd.to_datetime(df["index_date"], errors="coerce")
Output of df:
value index_date
0 33.24 2014-01-10 00:03:20
And if you run df.dtypes
value float64
index_date datetime64[ns]

Selecting columns by column NAME dtype

import pandas as pd
import numpy as np
cols = ['string',pd.Timestamp('2017-10-13'), 'anotherstring', pd.Timestamp('2017-10-14')]
pd.DataFrame(np.random.rand(5,4), columns=cols)
How can I get back just the 2nd and 4th column (which have dtype 'date time.datetime')? The types of the column contents are exactly the same, so select_dtypes doesn't help.
Use type with map:
df = df.loc[:, df.columns.map(type) == pd.Timestamp]
print (df)
2017-10-13 00:00:00 2017-10-14 00:00:00
0 0.894932 0.502015
1 0.080334 0.155712
2 0.600152 0.206344
3 0.008913 0.919534
4 0.280229 0.951434
Details:
print (df.columns.map(type))
Index([ <class 'str'>,
<class 'pandas._libs.tslib.Timestamp'>,
<class 'str'>,
<class 'pandas._libs.tslib.Timestamp'>]
print (df.columns.map(type) == pd.Timestamp)
[False True False True]
Alternative solution:
df1 = df.loc[:, [isinstance(i, pd.Timestamp) for i in df.columns]]
print (df1)
2017-10-13 00:00:00 2017-10-14 00:00:00
0 0.818283 0.128299
1 0.570288 0.458400
2 0.857426 0.395963
3 0.595765 0.306861
4 0.196899 0.438231

When to apply(pd.to_numeric) and when to astype(np.float64) in python?

I have a pandas DataFrame object named xiv which has a column of int64 Volume measurements.
In[]: xiv['Volume'].head(5)
Out[]:
0 252000
1 484000
2 62000
3 168000
4 232000
Name: Volume, dtype: int64
I have read other posts (like this and this) that suggest the following solutions. But when I use either approach, it doesn't appear to change the dtype of the underlying data:
In[]: xiv['Volume'] = pd.to_numeric(xiv['Volume'])
In[]: xiv['Volume'].dtypes
Out[]:
dtype('int64')
Or...
In[]: xiv['Volume'] = pd.to_numeric(xiv['Volume'])
Out[]: ###omitted for brevity###
In[]: xiv['Volume'].dtypes
Out[]:
dtype('int64')
In[]: xiv['Volume'] = xiv['Volume'].apply(pd.to_numeric)
In[]: xiv['Volume'].dtypes
Out[]:
dtype('int64')
I've also tried making a separate pandas Series and using the methods listed above on that Series and reassigning to the x['Volume'] obect, which is a pandas.core.series.Series object.
I have, however, found a solution to this problem using the numpy package's float64 type - this works but I don't know why it's different.
In[]: xiv['Volume'] = xiv['Volume'].astype(np.float64)
In[]: xiv['Volume'].dtypes
Out[]:
dtype('float64')
Can someone explain how to accomplish with the pandas library what the numpy library seems to do easily with its float64 class; that is, convert the column in the xiv DataFrame to a float64 in place.
If you already have numeric dtypes (int8|16|32|64,float64,boolean) you can convert it to another "numeric" dtype using Pandas .astype() method.
Demo:
In [90]: df = pd.DataFrame(np.random.randint(10**5,10**7,(5,3)),columns=list('abc'), dtype=np.int64)
In [91]: df
Out[91]:
a b c
0 9059440 9590567 2076918
1 5861102 4566089 1947323
2 6636568 162770 2487991
3 6794572 5236903 5628779
4 470121 4044395 4546794
In [92]: df.dtypes
Out[92]:
a int64
b int64
c int64
dtype: object
In [93]: df['a'] = df['a'].astype(float)
In [94]: df.dtypes
Out[94]:
a float64
b int64
c int64
dtype: object
It won't work for object (string) dtypes, that can't be converted to numbers:
In [95]: df.loc[1, 'b'] = 'XXXXXX'
In [96]: df
Out[96]:
a b c
0 9059440.0 9590567 2076918
1 5861102.0 XXXXXX 1947323
2 6636568.0 162770 2487991
3 6794572.0 5236903 5628779
4 470121.0 4044395 4546794
In [97]: df.dtypes
Out[97]:
a float64
b object
c int64
dtype: object
In [98]: df['b'].astype(float)
...
skipped
...
ValueError: could not convert string to float: 'XXXXXX'
So here we want to use pd.to_numeric() method:
In [99]: df['b'] = pd.to_numeric(df['b'], errors='coerce')
In [100]: df
Out[100]:
a b c
0 9059440.0 9590567.0 2076918
1 5861102.0 NaN 1947323
2 6636568.0 162770.0 2487991
3 6794572.0 5236903.0 5628779
4 470121.0 4044395.0 4546794
In [101]: df.dtypes
Out[101]:
a float64
b float64
c int64
dtype: object
I don't have a technical explanation for this but, I have noticed that pd.to_numeric() raises the following error when converting the string 'nan':
In [10]: df = pd.DataFrame({'value': 'nan'}, index=[0])
In [11]: pd.to_numeric(df.value)
Traceback (most recent call last):
File "<ipython-input-11-98729d13e45c>", line 1, in <module>
pd.to_numeric(df.value)
File "C:\Users\joshua.lee\AppData\Local\Continuum\anaconda3\lib\site-packages\pandas\core\tools\numeric.py", line 133, in to_numeric
coerce_numeric=coerce_numeric)
File "pandas/_libs/src\inference.pyx", line 1185, in pandas._libs.lib.maybe_convert_numeric
ValueError: Unable to parse string "nan" at position 0
whereas astype(float) does not:
df.value.astype(float)
Out[12]:
0 NaN
Name: value, dtype: float64
You can use this:
pd.to_numeric(df.value, errors='coerce').fillna(0, downcast='infer')
It will use zero in place of nan.
I observed that I was able to convert object(str) to float first and then float to Int64.
df = pd.DataFrame(np.random.randint(10**5,10**7,(5,3)),columns=list('abc'),
dtype=np.int64)
df['a'] = df['a'].astype('str')
df.dtypes
df['a'] = df['a'].astype('float')
df['a'] = df['a'].astype('int64')
Worked fine.
I think I have an explanation that buttresses what the others gave. In summary and as I will show below, pd.to_numeric(arg, errors='coerce') can handle numbers that cannot be converted to numeric, such as '50a' by converting them to NaN. You can then drop null values. Whereas, DataFrame.astype() does not have that ability.
In practice, I use pd.to_numeric(arg, errors='coerce') first especially when the DataFrame column or series has the possibility of holding numbers that cannot be converted to Numeric, as it converts those numbers to NaN, I then drop the NaN if desired, then use DataFrame.astype() to convert the datatype to the exact numeric data type I desire, such as float64, int32, int64 etc.
See examples below:
bio = {'Age': [56, 57, '50a'], 'Name': ['YOU', 'ME', 'HIM']}
df = pd.DataFrame(bio)
>>> df
Age Name
0 56 YOU
1 57 ME
2 50a HIM
>>> df['Age'] = df['Age'].astype(int)
.......
.......
ValueError: invalid literal for int() with base 10: '50a'
# Even when the error is forced to be ignore, the change is not made
>>> df['Age'] = df['Age'].astype(int, errors='ignore')
>>> df
Age Name
0 56 YOU
1 57 ME
2 50a HIM
Observe what will happen when I use pd.to_numeric(arg, errors='coerce')
>>> df['Age'] = pd.to_numeric(df['Age']) #Used without the coerce
........
........
ValueError: Unable to parse string "50a" at position 2
# When used with parameter: error = coerce, it changes invalid values to Nan.
# You can then use astype(int) or astype(float) to convert the NaN to 0
>>> df['Age'] = pd.to_numeric(df['Age'], errors='coerce')
>>> df
Age Name
0 56.0 YOU
1 57.0 ME
2 NaN HIM
# You can then drop nulls if you desire
In summary, both work hand in hand for specific purposes especially when handling nulls

Remove dtype datetime NaT

I am preparing a pandas df for output, and would like to remove the NaN and NaT in the table, and leave those table locations blank. An example would be
mydataframesample
col1 col2 timestamp
a b 2014-08-14
c NaN NaT
would become
col1 col2 timestamp
a b 2014-08-14
c
Most of the values are dtypes object, with the timestamp column being datetime64[ns]. In order to fix this, I attempted to use panda's mydataframesample.fillna(' ') to effectively leave a space in the location. However, this doesn't work with the datetime types. In order to get around this, I'm trying to convert the timestamp column back to object or string type.
Is it possible to remove the NaN/NaT without doing the type conversion? If not, how do I do the type conversion (tried str() and astype(str) but difficulty with datetime being the original format)?
I had the same issue: This does it all in place using pandas apply function. Should be the fastest method.
import pandas as pd
df['timestamp'] = df['timestamp'].apply(lambda x: x.strftime('%Y-%m-%d')if not pd.isnull(x) else '')
if your timestamp field is not yet in datetime format then:
import pandas as pd
df['timestamp'] = pd.to_datetime(df['timestamp']).apply(lambda x: x.strftime('%Y-%m-%d')if not pd.isnull(x) else '')
This won't win any speed awards, but if the DataFrame is not too long, reassignment using a list comprehension will do the job:
df1['date'] = [d.strftime('%Y-%m-%d') if not pd.isnull(d) else '' for d in df1['date']]
import numpy as np
import pandas as pd
Timestamp = pd.Timestamp
nan = np.nan
NaT = pd.NaT
df1 = pd.DataFrame({
'col1': list('ac'),
'col2': ['b', nan],
'date': (Timestamp('2014-08-14'), NaT)
})
df1['col2'] = df1['col2'].fillna('')
df1['date'] = [d.strftime('%Y-%m-%d') if not pd.isnull(d) else '' for d in df1['date']]
print(df1)
yields
col1 col2 date
0 a b 2014-08-14
1 c
#unutbu's answer will work fine, but if you don't want to modify the DataFrame, you could do something like this. to_html takes a parameter for how NaN is represented, to handle the NaT you need to pass a custom formatting function.
date_format = lambda d : pd.to_datetime(d).strftime('%Y-%m-%d') if not pd.isnull(d) else ''
df1.to_html(na_rep='', formatters={'date': date_format})
If all you want to do is convert to a string:
In [37]: df1.to_csv(None,sep=' ')
Out[37]: ' col1 col2 date\n0 a b "2014-08-14 00:00:00"\n1 c \n'
To replace missing values with a string
In [36]: df1.to_csv(None,sep=' ',na_rep='missing_value')
Out[36]: ' col1 col2 date\n0 a b "2014-08-14 00:00:00"\n1 c missing_value missing_value\n'

Reindexing pandas timeseries from object dtype to datetime dtype

I have a time-series that is not recognized as a DatetimeIndex despite being indexed by standard YYYY-MM-DD strings with valid dates. Coercing them to a valid DatetimeIndex seems to be inelegant enough to make me think I'm doing something wrong.
I read in (someone else's lazily formatted) data that contains invalid datetime values and remove these invalid observations.
In [1]: df = pd.read_csv('data.csv',index_col=0)
In [2]: print df['2008-02-27':'2008-03-02']
Out[2]:
count
2008-02-27 20
2008-02-28 0
2008-02-29 27
2008-02-30 0
2008-02-31 0
2008-03-01 0
2008-03-02 17
In [3]: def clean_timestamps(df):
# remove invalid dates like '2008-02-30' and '2009-04-31'
to_drop = list()
for d in df.index:
try:
datetime.date(int(d[0:4]),int(d[5:7]),int(d[8:10]))
except ValueError:
to_drop.append(d)
df2 = df.drop(to_drop,axis=0)
return df2
In [4]: df2 = clean_timestamps(df)
In [5] :print df2['2008-02-27':'2008-03-02']
Out[5]:
count
2008-02-27 20
2008-02-28 0
2008-02-29 27
2008-03-01 0
2008-03-02 17
This new index is still only recognized as a 'object' dtype rather than a DatetimeIndex.
In [6]: df2.index
Out[6]: Index([2008-01-01, 2008-01-02, 2008-01-03, ..., 2012-11-27, 2012-11-28,
2012-11-29], dtype=object)
Reindexing produces NaNs because they're different dtypes.
In [7]: i = pd.date_range(start=min(df2.index),end=max(df2.index))
In [8]: df3 = df2.reindex(index=i,columns=['count'])
In [9]: df3['2008-02-27':'2008-03-02']
Out[9]:
count
2008-02-27 NaN
2008-02-28 NaN
2008-02-29 NaN
2008-03-01 NaN
2008-03-02 NaN
I create a fresh dataframe with the appropriate index, drop the data to a dictionary, then populate the new dataframe based on the dictionary values (skipping missing values).
In [10]: df3 = pd.DataFrame(columns=['count'],index=i)
In [11]: values = dict(df2['count'])
In [12]: for d in i:
try:
df3.set_value(index=d,col='count',value=values[d.isoformat()[0:10]])
except KeyError:
pass
In [13]: print df3['2008-02-27':'2008-03-02']
Out[13]:
count
2008-02-27 20
2008-02-28 0
2008-02-29 27
2008-03-01 0
2008-03-02 17
In [14]: df3.index
Out[14];
<class 'pandas.tseries.index.DatetimeIndex'>
[2008-01-01 00:00:00, ..., 2012-11-29 00:00:00]
Length: 1795, Freq: D, Timezone: None
This last part of setting values based on lookups to a dictionary keyed by strings seems especially hacky and makes me think I've missed something important.
You could use pd.to_datetime:
In [1]: import pandas as pd
In [2]: pd.to_datetime('2008-02-27')
Out[2]: datetime.datetime(2008, 2, 27, 0, 0)
This allows you to "clean" the index (or similarly a column) by applying it to the Series:
df.index = pd.to_datetime(df.index)
or
df['date_col'] = df['date_col'].apply(pd.to_datetime)

Categories

Resources