I have a dataframe as below
customerid mydate
123 2016-08-15 18:22:40
234 2017-08-15 42:34.04
234 39:35.01
the mydate column is mixed, some have years with time others only time. The mydate column is an object but I want to convert it into date time as below
df['mydate'] = pd.to_datetime(df['mydate'], format='%Y-%m-%d %H:%M:%S')
but i get the below error
ValueError: time data 39:35.01 doesn't match format specified
You can use:
#get today date
today = pd.Timestamp.today().floor('D')
#split dates by whitespace
s = df['mydate'].str.split()
#first part convert to datetime if possible replace non exist dates to today
df['date'] = pd.to_datetime(s.str[0], errors='coerce').fillna(today)
#non exist second part of splited values replace by first part
s1 = s.str[1].combine_first(s.str[0])
#if 3. character from back is . add zero hours and convert to timedeltas
df['td'] = pd.to_timedelta(s1.mask(s1.str[-3] == '.', '00:'+ s1))
#add timedelta to dates
df['datefinal'] = df['date'] + df['td']
print (df)
customerid mydate date td \
0 123 2016-08-15 18:22:40 2016-08-15 18:22:40
1 234 2017-08-15 42:34.04 2017-08-15 00:42:34.040000
2 234 39:35.01 2018-11-21 00:39:35.010000
datefinal
0 2016-08-15 18:22:40.000
1 2017-08-15 00:42:34.040
2 2018-11-21 00:39:35.010
Your mydate column has non-datetime values so you need to cross-check data source integrity. And even if format was right, with the information provided, such a conversion isn't feasible, neither in Python nor in any other language (how would you know for which date, that particular time was recorded?).
Related
I'm trying to convert all data in a column from the below to dates.
Event Date
2020-07-16 00:00:00
31/03/2022, 26/11/2018, 31/01/2028
This is just a small section of the data - there are more columns/rows.
I've tried to split out the cells with multiple values using the below:
df["Event Date"] = df["Event Date"].str.replace(' ', '')
df["Event Date"] = df["Event Date"].str.split(",")
df= df.explode("Event Date")
The issue with this is it sets any cell without a ',' e.g. '2020-07-16 00:00:00' to NaN.
Is there any way to separate the values with a ',' and set the entire column to date types?
You can use combination of split and explode to separate dates and then use infer_datetime_format to convert mixed date types
df = df.assign(dates=df['dates'].str.split(',')).explode('dates')
df
Out[18]:
dates
0 2020-07-16 00:00:00
1 31/03/2022
1 26/11/2018
1 31/01/2028
df.dates = pd.to_datetime(df.dates, infer_datetime_format=True)
df.dates
Out[20]:
0 2020-07-16
1 2022-03-31
1 2018-11-26
1 2028-01-31
Name: dates, dtype: datetime64[ns]
Here is a proposition with pandas.Series.str.split and pandas.Series.explode :
s_dates = (
df["Event Date"]
.str.split(",")
.explode(ignore_index=True)
.apply(pd.to_datetime, dayfirst=True)
)
Output :
0 2020-07-16
1 2022-03-31
2 2018-11-26
3 2028-01-31
Name: Event Date, dtype: datetime64[ns]
Your example table shows mixed date formats in each row. The idea is to try a date parsing technique and then try another if it fails. Using loops and having such wide variations of data types are red flags with a script design. I recommend using datetime and dateutil to handle the dates.
from datetime import datetime
from dateutil import parser
date_strings = ["2020-07-16 00:00:00", "31/03/2022, 26/11/2018, 31/01/2028"] % Get these from your table.
parsed_dates = []
for date_string in date_strings:
try:
# strptime
date_object = datetime.strptime(date_string, "%Y-%m-%d %H:%M:%S")
parsed_dates.append(date_object)
except ValueError:
# parser.parse() and split
date_strings = date_string.split(",")
for date_str in date_strings:
date_str = date_str.strip()
date_object = parser.parse(date_str, dayfirst=True)
parsed_dates.append(date_object)
print(parsed_dates)
Try the code on Trinket: https://trinket.io/python3/95c0d14271
I have the following data (I purposely created a DateTime column from the string column of dates because that's how I am receiving the data):
import numpy as np
import pandas as pd
data = pd.DataFrame({"String_Date" : ['10/12/2021', '9/21/2021', '2/12/2010', '3/25/2009']})
#Create DateTime columns
data['Date'] = pd.to_datetime(data["String_Date"])
data
String_Date Date
0 10/12/2021 2021-10-12
1 9/21/2021 2021-09-21
2 2/12/2010 2010-02-12
3 3/25/2009 2009-03-25
I want to add the following "Month & Year Date" column with entries that are comparable (i.e. can determine whether Oct-12 < Sept-21):
String_Date Date Month & Year Date
0 10/12/2021 2021-10-12 Oct-12
1 9/21/2021 2021-09-21 Sept-21
2 2/12/2010 2010-02-12 Feb-12
3 3/25/2009 2009-03-25 Mar-25
The "Month & Year Date" column doesn't have to be in the exact format I show above (although bonus points if it does), just as long as it shows both the month (abbreviated name, full name, or month number) and the year in the same column. Most importantly, I want to be able to groupby the entries in the "Month & Year Date" column so that I can aggregate data in my original data set across every month.
You can do:
data["Month & Year Date"] = (
data["Date"].dt.month_name() + "-" + data["Date"].dt.year.astype(str)
)
print(data)
Prints:
String_Date Date Month & Year Date
0 10/12/2021 2021-10-12 October-2021
1 9/21/2021 2021-09-21 September-2021
2 2/12/2010 2010-02-12 February-2010
3 3/25/2009 2009-03-25 March-2009
But if you want to group by month/year it's preferable to use:
data.groupby([data["Date"].dt.month, data["Date"].dt.year])
data['Month & Year Date'] = data['Date'].dt.strftime('%b') + '-' + data['Date'].dt.strftime('%y')
print(data)
Outputs:
String_Date Date Month & Year Date
0 10/12/2021 2021-10-12 Oct-21
1 9/21/2021 2021-09-21 Sep-21
2 2/12/2010 2010-02-12 Feb-10
3 3/25/2009 2009-03-25 Mar-09
You can use the .dt accessor to format your date field however you like. For your example, it'd look like this:
data['Month & Year Date'] = data['Date'].dt.strftime('%b-%y')
Although honestly I don't think that's the best representation for the purpose of sorting or evaluating greater than or less than. If what you want is essentially a truncated date, you could do this instead:
As a string -
data['Month & Year Date'] = data['Date'].dt.strftime('%Y-%m-01')
As a datetime object -
data['Month & Year Date'] = data['Date'].dt.to_period.dt.to_timestamp()
You can use strftime. You can find the formats here
data['Month Day'] = data['Date'].apply(lambda x:x.strftime('%b-%d'))
Date
-1.476329
-2.754683
-0.763295
-3.113292
-1.353446
when I am trying to convert these -ve float values into dd-mm-yyyy , I am getting the year as 1969 or something with almost same date in every row. But the year should be near to 2018-2020
Computers store time from 01 Jan 1970. Since you didn't gave insight about your algorithm I can only guess that when you convert your float values it uses this default value.
Maybe Datetime defaulting to 1970 in pandas will help ?
As your dates should have years near to 2018-2020, probably your Date column contains number of years relative to now (or another base date). As such, you can do:
Find out what base date the dates are relative to. For demo purpose, I set it to today's date:
base_date = pd.to_datetime('now').normalize()
Then, derive the calendar dates from your Date column by multiplying 1 year duration by np.timedelta64(1, 'Y') and add the base date:
import numpy as np
df['Date_Derived'] = base_date + df['Date'] * np.timedelta64(1, 'Y')
Result:
print(df)
Date Date_Derived
0 -1.476329 2020-01-05 18:45:56.610792
1 -2.754683 2018-09-25 20:56:40.793784
2 -0.763295 2020-09-22 05:05:36.323160
3 -3.113292 2018-05-17 21:26:33.794016
4 -1.353446 2020-02-19 15:56:09.543408
You can further truncate the time values by:
df['Date_Derived'] = df['Date_Derived'].dt.normalize()
Result:
print(df)
Date Date_Derived
0 -1.476329 2020-01-05
1 -2.754683 2018-09-25
2 -0.763295 2020-09-22
3 -3.113292 2018-05-17
4 -1.353446 2020-02-19
I am trying to convert a day/month/Year Hours:Minutes column into just day and month. When I run my code, the conversion switches the months into days and the days into months.
You can find a copy of my dataframe with the one column I want to switch to Day/Month here
https://file.io/JkWl7fsBN0vl
Below is the code I am using to convert:
df =pd.read_csv('Example.csv')
df['DateTime'] = pd.to_datetime(df['DateTime'])
df.to_csv("output.csv", index=False)
Without knowing the exact DateTime format you are using (the link to the dataframe is broken), I'm going to use an example of
day/month/Year Hours:Minutes
05/09/2014 12:30
You can determine the exact format date code using this site
Essentially, to_datetime() has a format argument where you can pass in the specific format when it is not immediately obvious. This will let you specify that what it keeps confusing for month -> day, day -> month is actually the opposite.
>>> df = pd.DataFrame(['05/09/2014 12:30'],columns=['DateTime'])
DateTime
0 05/09/2014 12:30
>>> df['DateTime'] = pd.to_datetime(df['DateTime'], format='%d/%m/%Y %H:%M')
DateTime
0 2014-09-05 12:30:00
>>> df['day'] = df['DateTime'].dt.day
>>> df['month'] = df['DateTime'].dt.month
DateTime day month
0 2014-09-05 12:30:00 5 9
>>> df['DD/MM'] = df['DateTime'].dt.strftime('%d/%m')
DateTime day month DD/MM
0 2014-09-05 12:30:00 5 9 05/09
I'm unsure about the exact format you want the day and month available in (separate columns, combined), but I provided a few examples, so you can remove the DateTime column when you're done with it and use the one you need.
I have a dataframe with two columns; Sales and Date.
dataset.head(10)
Date Sales
0 2015-01-02 34988.0
1 2015-01-03 32809.0
2 2015-01-05 9802.0
3 2015-01-06 15124.0
4 2015-01-07 13553.0
5 2015-01-08 14574.0
6 2015-01-09 20836.0
7 2015-01-10 28825.0
8 2015-01-12 6938.0
9 2015-01-13 11790.0
I want to convert the Date column from yyyy-mm-dd (e.g. 2015-06-01) to yyyy-ww (e.g. 2015-23), so I run the following piece of code:
dataset["Date"] = pd.to_datetime(dataset["Date"]).dt.strftime('%Y-%V')
Then I group by my Sales based on weeks, i.e.
data = dataset.groupby(['Date'])["Sales"].sum().reset_index()
data.head(10)
Date Sales
0 2015-01 67797.0
1 2015-02 102714.0
2 2015-03 107011.0
3 2015-04 121480.0
4 2015-05 148098.0
5 2015-06 132152.0
6 2015-07 133914.0
7 2015-08 136160.0
8 2015-09 185471.0
9 2015-10 190793.0
Now I want to create a date range based on the Date column, since I'm predicting sales based on weeks:
ds = data.Date.values
ds_pred = pd.date_range(start=ds.min(), periods=len(ds) + num_pred_weeks,
freq="W")
However I'm getting the following error: could not convert string to Timestamp which I'm not really sure how to fix. So, if I use 2015-01-01 as the starting date of my date-import I get no error, which makes me realize that I'm using the functions wrong. However, I'm not sure how?
I would like to basically have a date range that spans weekly from the current week and then 52 weeks into the future.
I think problem is want create minimum of dataset["Date"] column filled by strings in format YYYY-VV. But for pass to date_range need format YYYY-MM-DD or datetime object.
I found this:
Several additional directives not required by the C89 standard are included for convenience. These parameters all correspond to ISO 8601 date values. These may not be available on all platforms when used with the strftime() method. The ISO 8601 year and ISO 8601 week directives are not interchangeable with the year and week number directives above. Calling strptime() with incomplete or ambiguous ISO 8601 directives will raise a ValueError.
%V ISO 8601 week as a decimal number with Monday as the first day of the week. Week 01 is the week containing Jan 4.
Pandas 0.24.2 bug with YYYY-VV format:
dataset = pd.DataFrame({'Date':['2015-06-01','2015-06-02']})
dataset["Date"] = pd.to_datetime(dataset["Date"]).dt.strftime('%Y-%V')
print (dataset)
Date
0 2015-23
1 2015-23
ds = pd.to_datetime(dataset['Date'], format='%Y-%V')
print (ds)
ValueError: 'V' is a bad directive in format '%Y-%V'
Possible solution is use %U or %W, check this:
%U Week number of the year (Sunday as the first day of the week) as a zero padded decimal number. All days in a new year preceding the first Sunday are considered to be in week 0.
%W Week number of the year (Monday as the first day of the week) as a decimal number. All days in a new year preceding the first Monday are considered to be in week 0.
dataset = pd.DataFrame({'Date':['2015-06-01','2015-06-02']})
dataset["Date"] = pd.to_datetime(dataset["Date"]).dt.strftime('%Y-%U')
print (dataset)
Date
0 2015-22
1 2015-22
ds = pd.to_datetime(dataset['Date'] + '-1', format='%Y-%U-%w')
print (ds)
0 2015-06-01
1 2015-06-01
Name: Date, dtype: datetime64[ns]
Or using data from original DataFrame in datetimes:
dataset = pd.DataFrame({'Date':['2015-06-01','2015-06-02'],
'Sales':[10,20]})
dataset["Date"] = pd.to_datetime(dataset["Date"])
print (dataset)
Date Sales
0 2015-06-01 10
1 2015-06-02 20
data = dataset.groupby(dataset['Date'].dt.strftime('%Y-%V'))["Sales"].sum().reset_index()
print (data)
Date Sales
0 2015-23 30
num_pred_weeks = 5
ds = data.Date.values
ds_pred = pd.date_range(start=dataset["Date"].min(), periods=len(ds) + num_pred_weeks, freq="W")
print (ds_pred)
DatetimeIndex(['2015-06-07', '2015-06-14', '2015-06-21',
'2015-06-28',
'2015-07-05', '2015-07-12'],
dtype='datetime64[ns]', freq='W-SUN')
If ds contains dates as string formatted as '2015-01' which should be '%Y-%W' (or '%G-%V' in datetime library) you have to add a day number to obtain a day. Here, assuming that you want the monday you should to:
ds_pred = pd.date_range(start=pd.to_datetime(ds.min() + '-1', format='%Y-%W-%w',
periods=len(ds) + num_pred_weeks, freq="W")