Pandas: How to read ill formated time data? - python

The time of my dataframe consist of 2 coloumns: date and HrMn, like this:
How can I read them into time, and plot a time series plot? (There are other value columns, for example, speed).
I think I can get away with time.strptime('19900125'+'1200','%Y%m%d%H%M')
But the problem is that, when read from the csv, HrMn at 0000 would be parsed as 0, so
time.strptime('19900125'+'0','%Y%m%d%H%M') will fail.
UPDATE:
My current approach:
# When reading the data, pase HrMn as string
df = pd.read_csv(uipath,header=0, skipinitialspace=True, dtype={'HrMn': str})
df['time']=df.apply(lambda x:datetime.strptime("{0} {1}".format(x['date'],x['HrMn']), "%Y%m%d %H%M"),axis=1)# df.temp_date
df.index= df['time']
# Then parse it again as int
df['HrMn'] = df['HrMn'].astype(int)

You can use pd.to_datetime after you've transformed it into a string that looks like a date:
def to_date_str(r):
d = r.date[: 4] + '-' + r.date[4: 6] + '-' + r.date[6: 8]
d += ' '+ r.HrMn[: 2] + ':' + r.HrMn[2: 4]
return d
>>> pd.to_datetime(df[['date', 'HrMn']].apply(to_date_str, axis=1))
0 1990-01-25 12:00:00
dtype: datetime64[ns]
Edit
As #EdChum comments, you can do this even more simply as
pd.to_datetime(df.date.astype(str) + df.HrMn)
which string-concatenates the columns.

You may parse the dates directly while reading the CSV, where HrMn is zero padded as HHMM, i.e. a value of 0 will represent 00:00:
df = pd.read_csv(
uipath,
header=0,
skipinitialspace=True,
dtype={'HrMn': str},
parse_dates={'datetime': ['date', 'HrMn']},
date_parser=lambda x, y: pd.datetime.strptime('{0}{1:04.0f}'.format(x, int(y)),
'%Y%m%d%H%M'),
index_col='datetime'
)

I don' get why you call it "ill formatted", that format is actually quite common and pandas can parse it as is, just specify which columns you want to parse as timestamps.
df = pd.read_csv(uipath, skipinitialspace=True,
parse_dates=[['date', 'HrMn']])

Related

How to add seconds in a datetime

I need to add seconds in YYYY-MM-DD-HH-MM-SS. My code works perfectly for one data point but not for the whole set. The data.txt consists of 7 columns and around 200 rows.
import numpy as np
import pandas as pd
from datetime import datetime, timedelta
df = pd.read_csv('data.txt',sep='\t',header=None)
a = np.array(list(df[0]))
b = np.array(list(df[1]))
c = np.array(list(df[2]))
d = np.array(list(df[3]))
e = np.array(list(df[4]))
f = np.array(list(df[5]))
g = np.array(list(df[6]))
t1=datetime(year=a, month=b, day=c, hour=d, minute=e, second=f)
t = t1 + timedelta(seconds=g)
print(t)
You can pass parameter names to read_csv for new columns names in first step and then convert first 5 columns to datetimes by to_datetime and add seconds converted to timedeltas by to_timedelta:
names = ["year","month","day","hour","minute","second","new"]
df = pd.read_csv('data.txt',sep='\t',names=names)
df['out'] = pd.to_datetime(df[names]) + pd.to_timedelta(df["new"], unit='s')
use apply with axis=1 to apply a function to every row of the dataframe.
df.apply(lambda x: datetime(year=x[0],
month=x[1],
day=x[2],
hour=x[3],
minute=x[4],
second=x[5]) + timedelta(seconds=int(x[6])) , axis=1)
generating dataset
simple to do as pandas series
s = 20
df = pd.DataFrame(np.array([np.random.randint(2015,2020,s),np.random.randint(1,12,s),np.random.randint(1,28,s),
np.random.randint(0,23,s), np.random.randint(0,59,s), np.random.randint(0,59,s),
np.random.randint(0,200,s)]).T,
columns=["year","month","day","hour","minute","second","add"])
pd.to_datetime(df.loc[:,["year","month","day","hour","minute","second"]]) + df["add"].apply(lambda s: pd.Timedelta(seconds=s))
without using apply()
pd.to_datetime(df.loc[:,["year","month","day","hour","minute","second"]]) + pd.to_timedelta(df["add"], unit="s")

Convert DataFrame column date from 2/3/2007 format to 20070223 with python

I have a dataframe with 'Date' and 'Value', where the Date is in format m/d/yyyy. I need to convert to yyyymmdd.
df2= df[["Date", "Transaction"]]
I know datetime can do this for me, but I can't get it to accept my format.
example data files:
6/15/2006,-4.27,
6/16/2006,-2.27,
6/19/2006,-6.35,
You first need to convert to datetime, using pd.datetime, then you can format it as you wish using strftime:
>>> df
Date Transaction
0 6/15/2006 -4.27
1 6/16/2006 -2.27
2 6/19/2006 -6.35
df['Date'] = pd.to_datetime(df['Date'],format='%m/%d/%Y').dt.strftime('%Y%m%d')
>>> df
Date Transaction
0 20060615 -4.27
1 20060616 -2.27
2 20060619 -6.35
You can say:
df['Date']=df['Date'].dt.strftime('%Y%m%d')
dt accesor's strftime method is your clear friend now.
Note: if didn't convert to pandas datetime yet, do:
df['Date']=pd.to_datetime(df['Date']).dt.strftime('%Y%m%d')
Output:
Date Transaction
0 20060615 -4.27
1 20060616 -2.27
2 20060619 -6.35
For a raw python solution, you could try something along the following (assuming datafile is a string).
datafile="6/15/2006,-4.27,\n6/16/2006,-2.27,\n6/19/2006,-6.35"
def zeroPad(str, desiredLen):
while (len(str) < desiredLen):
str = "0" + str
return str
def convToYYYYMMDD(datafile):
datafile = ''.join(datafile.split('\n')) # remove \n's, they're unreliable and not needed
datafile = datafile.split(',') # split by comma so '1,2' becomes ['1','2']
out = []
for i in range(0, len(datafile)):
if (i % 2 == 0):
tmp = datafile[i].split('/')
yyyymmdd = zeroPad(tmp[2], 4) + zeroPad(tmp[0], 2) + zeroPad(tmp[1], 2)
out.append(yyyymmdd)
else:
out.append(datafile[i])
return out
print(convToYYYYMMDD(datafile))
This outputs: ['20060615', '-4.27', '20060616', '-2.27', '20060619', '-6.35'].

python datetime convert, dates may contains whitespaces

I have a .csv file with a date column, and the date looks like this.
date
2016年 4月 1日 <-- there are whitespaces in thie row
...
2016年10月10日
The date format is Japanese date format. I'm trying to convert this column to 'YYYY-MM-DD', and the python code I'm using is below.
data['date'] = [datetime.datetime.strptime(d, '%Y年%m月%d日').date() for d in data['date']]
There is one problem, the date column in the .csv may contain whitespace when the month/day is a single digit. And my code doesn't work well when there is a whitespace.
Anyone solutions?
In pandas is best avoid list comprehension if exist vectorized solutions because performance and no support NaNs.
I think need replace by \s+ : one or more whitespaces with pandas.to_datetime for converting to datetimes and last for dates add date:
data['date'] = (pd.to_datetime(data['date'].str.replace('\s+', ''), format='%Y年%m月%d日')
.dt.date)
Performance:
The plot was created with perfplot:
def list_compr(df):
df['date1'] = [datetime.datetime.strptime(d.replace(" ", ""), '%Y年%m月%d日').date() for d in df['date']]
return df
def vector(df):
df['date2'] = (pd.to_datetime(df['date'].str.replace('\s+', ''), format='%Y年%m月%d日').dt.date)
return df
def make_df(n):
df = pd.DataFrame({'date':['2016年 4月 1日','2016年10月10日']})
df = pd.concat([df] * n, ignore_index=True)
return df
perfplot.show(
setup=make_df,
kernels=[list_compr, vector],
n_range=[2**k for k in range(2, 13)],
logx=True,
logy=True,
equality_check=False, # rows may appear in different order
xlabel='len(df)')
I don't know Python actually, but wouldn't something like replacing d in strptime with d.replace(" ", "") do the trick?

Adding a specified value to each in a pandas data frame

I am iterating over the rows that are available, but it doesn't seem to be the most optimal way to do it -- it takes forever.
Is there a special way in Pandas to do it.
INIT_TIME = datetime.datetime.strptime(date + ' ' + time, "%Y-%B-%d %H:%M:%S")
#NEED TO ADD DATA FROM THAT COLUMN
df = pd.read_csv(dataset_path, delimiter=',',skiprows=range(0,1),names=['TCOUNT','CORE','COUNTER','EMPTY','NAME','TSTAMP','MULT','STAMPME'])
df = df.drop('MULT',1)
df = df.drop('EMPTY',1)
df = df.drop('TSTAMP', 1)
for index, row in df.iterrows():
TMP_TIME = INIT_TIME + datetime.timedelta(seconds=row['TCOUNT'])
df['STAMPME'] = TMP_TIME.strftime("%s")
In addition, the datetime I am adding is in the following format
2017-05-11 11:12:37.100192 1494493957
2017-05-11 11:12:37.200541 1494493957
and therefore the unix timestamp is same (and it is correct), but is there a better way to represent it?
Assuming the datetimes are correctly reflecting what you're trying to do, with respect to Pandas you should be able to do:
df['STAMPME'] = df['TCOUNT'].apply(lambda x: (datetime.timedelta(seconds=x) + INIT_TIME).strftime("%s"))
As noted here you should not use iterrows() to modify the DF you are iterating over. If you need to iterate row by row (as opposed to using the apply method) you can use another data object, e.g. a list, to retain the values you're calculating, and then create a new column from that.
Also, for future reference, the itertuples() method is faster than iterrows(), although it requires you to know the index of each column (i.e. row[x] as opposed to row['name']).
I'd rewrite your code like this
INIT_TIME = datetime.datetime.strptime(date + ' ' + time, "%Y-%B-%d %H:%M:%S")
INIT_TIME = pd.to_datetime(INIT_TIME)
df = pd.read_csv(
dataset_path, delimiter=',',skiprows=range(0,1),
names=['TCOUNT','CORE','COUNTER','EMPTY','NAME','TSTAMP','MULT','STAMPME']
)
df = df.drop(['MULT', 'EMPTY', 'TSTAMP'], 1)
df['STAMPME'] = pd.to_timedelta(df['TCOUNT'], 's') + INIT_TIME

pandas: read_csv excluding only certain rows

I'm trying to import a csv file that looks like this
Irrelevant row
"TIMESTAMP","RECORD","Site","Logger","Avg_70mSE_Avg","Avg_60mS_Avg",
"TS","RN","","","metres/second","metres/second",
"","","Smp","Smp","Avg","Avg",
"2010-05-18 12:30:00",0,"Sisters",5068,5.162,4.996
"2010-05-18 12:40:00",1,"Sisters",5068,5.683,5.571
The second row is the header but rows 0, 2, 3 are irrelevant. My code at the moment is:
parse = lambda x: datetime.strptime(x, '%Y-%m-%d %H:%M:%S')
df = pd.read_csv('data.csv', header=1, index_col=['TIMESTAMP'],
parse_dates=['TIMESTAMP'], date_parser = parse)
The problem is that since rows 2 and 3 don't have correct dates I get an error (or at least I think this the error).
Would it be possible to exclude these rows, using something like skiprows, but for rows that are not in the beginning of the file? Or do you have any other suggestions?
You can use the skiprows keyword to ignore the rows:
pd.read_csv('data.csv', skiprows=[0, 2, 3],
index_col=['TIMESTAMP'], parse_dates=['TIMESTAMP'])
Which for your sample data gives:
RECORD Site Logger Avg_70mSE_Avg Avg_60mS_Avg
TIMESTAMP
2010-05-18 12:30:00 0 Sisters 5068 5.162 4.996
2010-05-18 12:40:00 1 Sisters 5068 5.683 5.571
The first parsed row (1) becomes the header and read_csv's default parser correctly parses the timestamp column.

Categories

Resources