I have come across a CSV file that contains a date column formatted in the following manner: xx:xx.x, here's a couple of the data present in the column marked as date:
07:33.0
34:53.0
06:30.0
30:09.0
02:18.0
My question is what type of formatting is this? And how can I convert it to a proper date format using Python?
It looks like times without hours.
You can create timedeltas by add 0 hours by to_timedelta:
df['col'] = pd.to_timedelta('00:' + df['col'])
print (df)
col
0 0 days 00:07:33
1 0 days 00:34:53
2 0 days 00:06:30
3 0 days 00:30:09
4 0 days 00:02:18
Or convert to datetimes by to_datetime - there is added default date:
df['col'] = pd.to_datetime(df['col'], format='%M:%S.%f')
print (df)
col
0 1900-01-01 00:07:33
1 1900-01-01 00:34:53
2 1900-01-01 00:06:30
3 1900-01-01 00:30:09
4 1900-01-01 00:02:18
Related
I have a dataset that has mixed data types in the Date column.
For example, the column looks like this:
ID Date
1 2019-01-01
2 2019-01-02
3 2019-11-01
4 40993
5 40577
6 39949
When I just try to convert the column using pd.to_datetime, I get an error message "mixed datetimes and integers in passed array".
I would really appreciate it if someone could help me out with this! Ideally, it would be nice to have all rows in 'yyyy-mm-dd' format.
Thank you!
I'm guessing those are excel date format?
Convert Excel style date with pandas
import xlrd
def read_date(date):
try:
return xlrd.xldate.xldate_as_datetime(int(date), 0)
except:
return pd.to_datetime(date)
df['New Date'] = df['Date'].apply(read_date)
df
Out[1]:
ID Date New Date
0 1 2019-01-01 2019-01-01
1 2 2019-01-02 2019-01-02
2 3 2019-11-01 2019-11-01
3 4 40993 2012-03-25
4 5 40577 2011-02-03
5 6 39949 2009-05-16
I have a dataframe like this, how to sort this.
df = pd.DataFrame({'Date':['Oct20','Nov19','Jan19','Sep20','Dec20']})
Date
0 Oct20
1 Nov19
2 Jan19
3 Sep20
4 Dec20
I familiar in sorting list of dates(string)
a.sort(key=lambda date: datetime.strptime(date, "%d-%b-%y"))
Any thoughts? Should i split it ?
First convert column to datetimes and get positions of sorted values by Series.argsort what is used for change ordering with DataFrame.iloc:
df = df.iloc[pd.to_datetime(df['Date'], format='%b%y').argsort()]
print (df)
Date
2 Jan19
1 Nov19
3 Sep20
0 Oct20
4 Dec20
Details:
print (pd.to_datetime(df['Date'], format='%b%y'))
0 2020-10-01
1 2019-11-01
2 2019-01-01
3 2020-09-01
4 2020-12-01
Name: Date, dtype: datetime64[ns]
I have a large dataframe (several million rows) where one of my columns is a timestamp (labeled 'Timestamp') in the format "hh:mm:ss" e.g. "07:00:04". I want to drop the rows where the hour is NOT between or equal to 7 and 21.
I've have tried to convert the timestamps to strings and use slicing but I was not able to get it working and I believe there should be a more effective way.
# Create list of opening hours (these should not be droped)
opening_hour = 7
closeing_hour = 21
trading_hours = []
for hour in range(closeing_hour - opening_hour + 1):
add_hour = opening_hour + hour
trading_hours.append(add_hour)
My dataframe looks something like this:
Date Timestamp Close
0 20180102 07:05:00 12925.979
1 20180102 21:05:02 12925.479
2 20180102 22:05:04 12925.280
3 20180102 23:55:06 12925.479
4 20180102 06:05:07 12925.780
5 20180103 07:05:07 12925.780
[...]
I want to drop the rows with index 2, 3 and 4 (there are several thousand), so the result should be something like:
Date Timestamp Close
0 20180102 07:05:00 12925.979
1 20180102 21:05:02 12925.479
2 20180103 07:05:07 12925.780
[...]
First you can give your DataFrame a proper DatetimeIndex as follows:
dtidx = pd.DatetimeIndex(df['Date'].astype(str) + ' ' + df['Timestamp'].astype(str))
df.index = dtidx
and then use between_time to get the hours between hours 07 and 21 inclusive:
df.between_time('07:00', '22:00')
# returns
Date Timestamp Close
2018-01-02 07:05:00 20180102 07:05:00 12926
2018-01-02 21:05:02 20180102 21:05:02 12925.5
2018-01-03 07:05:07 20180103 07:05:07 12925.8
Since you mentioned about slicing and someone already mentioned about how to go with it, I would like to introduce you to extracting the hour using dt.hour
First convert your date with type string to date with type datetime:
df['date'] = pd.to_datetime(df['date'])
You can now easily extract the hour part using dt.hour:
df['hour'] = df['date'].dt.hour
You can also extract year, month, second, and so on in a similar way.
Now you can do normal filtering as you would do with other dataframes:
df[(df.hour >= 7) & (df.hour <= 21)]
I prefer the other answers which work with proper timestamp data types, but since you mentioned trying and failing with a string slicing method, it might be helpful for you to see a solution using string slicing that does work:
df['Hour'] = df['Timestamp'].str.slice(0, 2).astype(int)
df[(df['Hour'] >= 7) & (df['Hour'] <= 21)]
The first line creates a new integer column from the slice of the string which represents the hour, and the second line filters on said new column.
Date Timestamp Close Hour
0 20180102 07:05:00 12925.979 7
1 20180102 21:05:02 12925.479 21
5 20180103 07:05:07 12925.780 7
My guess would be to use pd.between_time.
df['Timestamp'] = pd.to_datetime(df['Timestamp'])
df.set_index('Timestamp').between_time('07:00:00', '21:59:59')
Timestamp Date Close
2019-07-22 07:05:00 20180102 12925.979
2019-07-22 21:05:02 20180102 12925.479
2019-07-22 07:05:07 20180103 12925.78
I have a column in my dataframe which I want to convert to a Timestamp. However, it is in a bit of a strange format that I am struggling to manipulate. The column is in the format HHMMSS, but does not include the leading zeros.
For example for a time that should be '00:03:15' the dataframe has '315'. I want to convert the latter to a Timestamp similar to the former. Here is an illustration of the column:
message_time
25
35
114
1421
...
235347
235959
Thanks
Use Series.str.zfill for add leading zero and then to_datetime:
s = df['message_time'].astype(str).str.zfill(6)
df['message_time'] = pd.to_datetime(s, format='%H%M%S')
print (df)
message_time
0 1900-01-01 00:00:25
1 1900-01-01 00:00:35
2 1900-01-01 00:01:14
3 1900-01-01 00:14:21
4 1900-01-01 23:53:47
5 1900-01-01 23:59:59
In my opinion here is better create timedeltas by to_timedelta:
s = df['message_time'].astype(str).str.zfill(6)
df['message_time'] = pd.to_timedelta(s.str[:2] + ':' + s.str[2:4] + ':' + s.str[4:])
print (df)
message_time
0 00:00:25
1 00:00:35
2 00:01:14
3 00:14:21
4 23:53:47
5 23:59:59
I am currently trying to reproduce this: convert numeric sas date to datetime in Pandas
, but get the following error:
"Python int too large to convert to C long"
Here and example of my dates:
0 1.416096e+09
1 1.427069e+09
2 1.433635e+09
3 1.428624e+09
4 1.433117e+09
Name: dates, dtype: float64
Any ideas?
Here is a little hacky solution. If the date column is called 'date', try
df['date'] = pd.to_datetime(df['date'] - 315619200, unit = 's')
Here 315619200 is the number of seconds between Jan 1 1960 and Jan 1 1970.
You get
0 2004-11-15 00:00:00
1 2005-03-22 00:03:20
2 2005-06-05 23:56:40
3 2005-04-09 00:00:00
4 2005-05-31 00:03:20