Remove dtype datetime NaT - python

I am preparing a pandas df for output, and would like to remove the NaN and NaT in the table, and leave those table locations blank. An example would be
mydataframesample
col1 col2 timestamp
a b 2014-08-14
c NaN NaT
would become
col1 col2 timestamp
a b 2014-08-14
c
Most of the values are dtypes object, with the timestamp column being datetime64[ns]. In order to fix this, I attempted to use panda's mydataframesample.fillna(' ') to effectively leave a space in the location. However, this doesn't work with the datetime types. In order to get around this, I'm trying to convert the timestamp column back to object or string type.
Is it possible to remove the NaN/NaT without doing the type conversion? If not, how do I do the type conversion (tried str() and astype(str) but difficulty with datetime being the original format)?

I had the same issue: This does it all in place using pandas apply function. Should be the fastest method.
import pandas as pd
df['timestamp'] = df['timestamp'].apply(lambda x: x.strftime('%Y-%m-%d')if not pd.isnull(x) else '')
if your timestamp field is not yet in datetime format then:
import pandas as pd
df['timestamp'] = pd.to_datetime(df['timestamp']).apply(lambda x: x.strftime('%Y-%m-%d')if not pd.isnull(x) else '')

This won't win any speed awards, but if the DataFrame is not too long, reassignment using a list comprehension will do the job:
df1['date'] = [d.strftime('%Y-%m-%d') if not pd.isnull(d) else '' for d in df1['date']]
import numpy as np
import pandas as pd
Timestamp = pd.Timestamp
nan = np.nan
NaT = pd.NaT
df1 = pd.DataFrame({
'col1': list('ac'),
'col2': ['b', nan],
'date': (Timestamp('2014-08-14'), NaT)
})
df1['col2'] = df1['col2'].fillna('')
df1['date'] = [d.strftime('%Y-%m-%d') if not pd.isnull(d) else '' for d in df1['date']]
print(df1)
yields
col1 col2 date
0 a b 2014-08-14
1 c

#unutbu's answer will work fine, but if you don't want to modify the DataFrame, you could do something like this. to_html takes a parameter for how NaN is represented, to handle the NaT you need to pass a custom formatting function.
date_format = lambda d : pd.to_datetime(d).strftime('%Y-%m-%d') if not pd.isnull(d) else ''
df1.to_html(na_rep='', formatters={'date': date_format})

If all you want to do is convert to a string:
In [37]: df1.to_csv(None,sep=' ')
Out[37]: ' col1 col2 date\n0 a b "2014-08-14 00:00:00"\n1 c \n'
To replace missing values with a string
In [36]: df1.to_csv(None,sep=' ',na_rep='missing_value')
Out[36]: ' col1 col2 date\n0 a b "2014-08-14 00:00:00"\n1 c missing_value missing_value\n'

Related

How to convert string date column to timestamp in a new column in Python Pandas

I have the following example dataframe:
d = {'col1': ["2022-05-16T12:31:00Z", "2021-01-11T11:32:00Z"]}
df = pd.DataFrame(data=d)
df
col1
0 2022-05-16T12:31:00Z
1 2021-01-11T11:32:00Z
I need a second column (say col2) which will have the corresponding timestamp value for each col1 date string value from col1.
How can I do that without using a for loop?
Maybe try this?
import pandas as pd
import numpy as np
d = {'col1': ["2022-05-16T12:31:00Z", "2021-01-11T11:32:00Z"]}
df = pd.DataFrame(data=d)
df['col2'] = pd.to_datetime(df['col1'])
df['col2'] = df.col2.values.astype(np.int64) // 10 ** 9
df
Let us try to_datetime
df['col2'] = pd.to_datetime(df['col1'])
df
Out[614]:
col1 col2
0 2022-05-16T12:31:00Z 2022-05-16 12:31:00+00:00
1 2021-01-11T11:32:00Z 2021-01-11 11:32:00+00:00
Update
st = pd.to_datetime('1970-01-01T00:00:00Z')
df['unix'] = (pd.to_datetime(df['col1'])- st).dt.total_seconds()
Out[632]:
0 1.652704e+09
1 1.610365e+09
Name: col1, dtype: float64

Parsing Pandas df Column of mixed data into Datetime

df = pd.DataFrame('23.Jan.2020 01.Mar.2017 5663:33 20.May.2021 626'.split())
I want to convert to date-like elements to datetime and for numbers, to return the original value.
I have tried
t=pd.to_datetime(df[0], format='%d.%b.%Y', errors='ignore')
which just returns to original df with no change. And I have tried to change errors to 'coerce', which does the conversion for date like elements, but numbers are dropped
t=pd.to_datetime(df[0], format='%d.%b.%Y', errors='coerce')
Then I attempt to return the original df value if NaT, else substitute with the new datetime from t
df.where(t.isnull(), other=t, axis=1)
Which works for returning the original df value where NaT, but it doesn't transfer the datetime
Maybe this is what you want?
dt = pd.Series('23.Jan.2020 01.Mar.2017 5663:33 20.May.2021 626'.split())
res = pd.to_datetime(dt, format="%d.%b.%Y", errors='coerce').fillna(dt)
This way the resulting elements in the series has the correct types:
>>> res.map(type)
0 <class 'pandas._libs.tslibs.timestamps.Timesta...
1 <class 'pandas._libs.tslibs.timestamps.Timesta...
2 <class 'str'>
3 <class 'pandas._libs.tslibs.timestamps.Timesta...
4 <class 'str'>
dtype: object
PS: I used a Series because it's easier to pass to to_datetime, and to Series.fillna.
this will combine the two field types in the way you have specified:
import pandas as pd
df = pd.DataFrame('23.Jan.2020 01.Mar.2017 5663:33 20.May.2021 626'.split())
mod = pd.to_datetime(df[0], format='%d.%b.%Y', errors='coerce')
ndf = pd.concat([df, mod], axis=1)
ndf.columns = ['original', 'modified']
def funk(col1,col2):
return col1 if pd.isnull(col2) else col2
ndf.apply(lambda x: funk(x.original,x.modified), axis=1)
# 0 2020-01-23 00:00:00
# 1 2017-03-01 00:00:00
# 2 5663:33
# 3 2021-05-20 00:00:00
# 4 626

Easiest way to determine whether column in pandas Dataframe contains DATE or DATETIME information

I have the following DF:
col1 col2
1 2017-01-03 2018-03-30 08:01:32
2 2017-01-04 2018-03-30 08:02:32
If I do df.dtypes, I get get the following output:
col1 datetime64[ns]
col2 datetime64[ns]
dtype: object
Howeverm col1 contains Only Date information (DATE), whereas col2 contains both date and time information (DATETIME).
Whats the easiest way to determine wheter a column contains DATE or DATETIME information?
Data generation:
import pandas as pd
# Generate the df
col1 = ["2017-01-03", "2017-01-04"]
col2 = ["2018-03-30 08:01:32", "2018-03-30 08:02:32"]
df = pd.DataFrame({"col1": col1, "col2": col2})
df["col1"] = pd.to_datetime(df["col1"])
df["col2"] = pd.to_datetime(df["col2"])
According to this SO Question, the following function could do the job:
def check_col(col):
try:
dt = pd.to_datetime(df[col])
if (dt.dt.floor('d') == dt).all():
return('Its a DATE field')
else:
return('Its a DATETIME field')
except:
return("could not parse to pandas datetime")
However, isn't there a more straightforward way?
You can try this:
def col_has_time(col):
dt = pd.to_datetime(df[col])
return (dt.hour == 0).all()

What pandas function does change the column type in an "inline" manner?

I know that the following commands could help change the column type:
df['date'] = str(df['date'])
df['A'] = pd.to_datetime(df['A'])
df['A'] = df.A.astype(np.datetime64)
But do you know a better way to change the column type in an inline manner to make it in one line following with other aggregating commands such as groupby, dropna, etc. For example:
df\
#.function to cast df.A to np.datetime64 \
.groupby('C') \
.apply(lambda x: x.set_index('A').resample('1M').sum())
You can use assign:
df.assign(A=pd.to_datetime(df['A']))
df = pd.DataFrame({'A': ['20150101', '20140702'], 'B': [1, 2]})
df
Out:
A B
0 20150101 1
1 20140702 2
df.assign(A=pd.to_datetime(df['A']))
Out:
A B
0 2015-01-01 1
1 2014-07-02 2

Convert categorical data in pandas dataframe

I have a dataframe with this type of data (too many columns):
col1 int64
col2 int64
col3 category
col4 category
col5 category
Columns look like this:
Name: col3, dtype: category
Categories (8, object): [B, C, E, G, H, N, S, W]
I want to convert all the values in each column to integer like this:
[1, 2, 3, 4, 5, 6, 7, 8]
I solved this for one column by this:
dataframe['c'] = pandas.Categorical.from_array(dataframe.col3).codes
Now I have two columns in my dataframe - old col3 and new c and need to drop old columns.
That's bad practice. It works but in my dataframe there are too many columns and I don't want do it manually.
How can I do this more cleverly?
First, to convert a Categorical column to its numerical codes, you can do this easier with: dataframe['c'].cat.codes.
Further, it is possible to select automatically all columns with a certain dtype in a dataframe using select_dtypes. This way, you can apply above operation on multiple and automatically selected columns.
First making an example dataframe:
In [75]: df = pd.DataFrame({'col1':[1,2,3,4,5], 'col2':list('abcab'), 'col3':list('ababb')})
In [76]: df['col2'] = df['col2'].astype('category')
In [77]: df['col3'] = df['col3'].astype('category')
In [78]: df.dtypes
Out[78]:
col1 int64
col2 category
col3 category
dtype: object
Then by using select_dtypes to select the columns, and then applying .cat.codes on each of these columns, you can get the following result:
In [80]: cat_columns = df.select_dtypes(['category']).columns
In [81]: cat_columns
Out[81]: Index([u'col2', u'col3'], dtype='object')
In [83]: df[cat_columns] = df[cat_columns].apply(lambda x: x.cat.codes)
In [84]: df
Out[84]:
col1 col2 col3
0 1 0 0
1 2 1 1
2 3 2 0
3 4 0 1
4 5 1 1
This works for me:
pandas.factorize( ['B', 'C', 'D', 'B'] )[0]
Output:
[0, 1, 2, 0]
If your concern was only that you making a extra column and deleting it later, just dun use a new column at the first place.
dataframe = pd.DataFrame({'col1':[1,2,3,4,5], 'col2':list('abcab'), 'col3':list('ababb')})
dataframe.col3 = pd.Categorical.from_array(dataframe.col3).codes
You are done. Now as Categorical.from_array is deprecated, use Categorical directly
dataframe.col3 = pd.Categorical(dataframe.col3).codes
If you also need the mapping back from index to label, there is even better way for the same
dataframe.col3, mapping_index = pd.Series(dataframe.col3).factorize()
check below
print(dataframe)
print(mapping_index.get_loc("c"))
Here multiple columns need to be converted. So, one approach i used is ..
for col_name in df.columns:
if(df[col_name].dtype == 'object'):
df[col_name]= df[col_name].astype('category')
df[col_name] = df[col_name].cat.codes
This converts all string / object type columns to categorical. Then applies codes to each type of category.
What I do is, I replace values.
Like this-
df['col'].replace(to_replace=['category_1', 'category_2', 'category_3'], value=[1, 2, 3], inplace=True)
In this way, if the col column has categorical values, they get replaced by the numerical values.
For converting categorical data in column C of dataset data, we need to do the following:
from sklearn.preprocessing import LabelEncoder
labelencoder= LabelEncoder() #initializing an object of class LabelEncoder
data['C'] = labelencoder.fit_transform(data['C']) #fitting and transforming the desired categorical column.
To convert all the columns in the Dataframe to numerical data:
df2 = df2.apply(lambda x: pd.factorize(x)[0])
Answers here seem outdated. Pandas now has a factorize() function and you can create categories as:
df.col.factorize()
Function signature:
pandas.factorize(values, sort=False, na_sentinel=- 1, size_hint=None)
One of the simplest ways to convert the categorical variable into dummy/indicator variables is to use get_dummies provided by pandas.
Say for example we have data in which sex is a categorical value (male & female)
and you need to convert it into a dummy/indicator here is how to do it.
tranning_data = pd.read_csv("../titanic/train.csv")
features = ["Age", "Sex", ] //here sex is catagorical value
X_train = pd.get_dummies(tranning_data[features])
print(X_train)
Age Sex_female Sex_male
20 0 1
33 1 0
40 1 0
22 1 0
54 0 1
you can use .replace as the following:
df['col3']=df['col3'].replace(['B', 'C', 'E', 'G', 'H', 'N', 'S', 'W'],[1,2,3,4,5,6,7,8])
or .map:
df['col3']=df['col3'].map({1: 'B', 2: 'C', 3: 'E', 4:'G', 5:'H', 6:'N', 7:'S', 8:'W'})
categorical_columns =['sex','class','deck','alone']
for column in categorical_columns:
df[column] = pd.factorize(df[column])[0]
Factorize will make each unique categorical data in a column into a specific number (from 0 to infinity).
#Quickbeam2k1 ,see below -
dataset=pd.read_csv('Data2.csv')
np.set_printoptions(threshold=np.nan)
X = dataset.iloc[:,:].values
Using sklearn
from sklearn.preprocessing import LabelEncoder
labelencoder_X=LabelEncoder()
X[:,0] = labelencoder_X.fit_transform(X[:,0])
You can do it less code like below :
f = pd.DataFrame({'col1':[1,2,3,4,5], 'col2':list('abcab'),'col3':list('ababb')})
f['col1'] =f['col1'].astype('category').cat.codes
f['col2'] =f['col2'].astype('category').cat.codes
f['col3'] =f['col3'].astype('category').cat.codes
f
Just use manual matching:
dict = {'Non-Travel':0, 'Travel_Rarely':1, 'Travel_Frequently':2}
df['BusinessTravel'] = df['BusinessTravel'].apply(lambda x: dict.get(x))
For a certain column, if you don't care about the ordering, use this
df['col1_num'] = df['col1'].apply(lambda x: np.where(df['col1'].unique()==x)[0][0])
If you care about the ordering, specify them as a list and use this
df['col1_num'] = df['col1'].apply(lambda x: ['first', 'second', 'third'].index(x))

Categories

Resources