I am looking to add a new column - "date" to my Pandas dataframe. Below are the first 5 rows of my dataframe:
First 5 rows of the dataframe
As seen from the image, the first column is year, second month, and third day. Below is what I have tried to do:
df['Year'] = pd.to_datetime(df[['Year','Month','Day']])
But, I keep getting the error as below:
ValueError: cannot assemble the datetimes: time data '610101' does not match format
'%Y%m%d' (match)
It would be great if I can get any help for the same.
Following up on my comment, I was able to reproduce the error and solve it by adding 1900 to the year
df = pd.DataFrame({"year": [61,99], "month": [1, 2], "day": [3, 12]})
df["year"] = df["year"] + 1900
df['full_date'] = pd.to_datetime(df[['year','month','day']])
Output:
year month day full_date
0 1961 1 3 1961-01-03
1 1999 2 12 1999-02-12
There is a format parameter to the to_datetime method but for some reason I wasn't able to make it work. doc
df['full_date'] = pd.to_datetime(df[['year','month','day']], format="%y%m%d", infer_datetime_format=False)
this still throw the same error although I am using %y which should be 2 digit year but the error message still says it does not match this format '%Y%m%d'
try this:
df.apply(lambda x:'%s %s %s' % (x['year'],x['month'], x['day']),axis=1)
What about printing out what is actually showing for your selection first ?
print(df[['Year','Month','Day']])
if the data is indeed "610101", then you would likely need to modify it with '19':
pd.to_datetime('19' + df[['Year','Month','Day']])
Related
I have a time column with the format XXXHHMMSS where XXX is the Day of Year. I also have a year column. I want to merge both these columns into one date time object.
Before I had detached XXX into a new column but this was making it more complicated.
I've converted the two columns to strings
points['UTC_TIME'] = points['UTC_TIME'].astype(str)
points['YEAR_'] = points['YEAR_'].astype(str)
Then I have the following line:
points['Time'] = pd.to_datetime(points['YEAR_'] * 1000 + points['UTC_TIME'], format='%Y%j%H%M%S')
I'm getting the value errorr, ValueError: time data '137084552' does not match format '%Y%j%H%M%S' (match)
Here is a photo of my columns and a link to the data
works fine for me if you combine both columns as string, EX:
import pandas as pd
df = pd.DataFrame({'YEAR_': [2002, 2002, 2002],
'UTC_TIME': [99082552, 135082552, 146221012]})
pd.to_datetime(df['YEAR_'].astype(str) + df['UTC_TIME'].astype(str).str.zfill(9),
format="%Y%j%H%M%S")
# 0 2002-04-09 08:25:52
# 1 2002-05-15 08:25:52
# 2 2002-05-26 22:10:12
# dtype: datetime64[ns]
Note, since %j expects zero-padded day of year, you might need to zero-fill, see first row in the example above.
I have a column of dates in the following format:
Jan-85
Apr-99
Nov-01
Feb-65
Apr-57
Dec-19
I want to convert this to a pandas datetime object.
The following syntax works to convert them:
pd.to_datetime(temp, format='%b-%y')
where temp is the pd.Series object of dates. The glaring issue here of course is that dates that are prior to 1970 are being wrongly converted to 20xx.
I tried updating the function call with the following parameter:
pd.to_datetime(temp, format='%b-%y', origin='1950-01-01')
However, I am getting the error:
Name: temp, Length: 42537, dtype: object' is not compatible with origin='1950-01-01'; it must be numeric with a unit specified
I tried specifying a unit as it said, but I got a different error citing that the unit cannot be specified alongside a format.
Any ideas how to fix this?
Just #DudeWah's logic, but improving upon the code:
def days_of_future_past(date,chk_y=pd.Timestamp.today().year):
return date.replace(year=date.year-100) if date.year > chk_y else date
temp = pd.to_datetime(temp,format='%b-%y').map(days_of_future_past)
Output:
>>> temp
0 1985-01-01
1 1999-04-01
2 2001-11-01
3 1965-02-01
4 1957-04-01
5 2019-12-01
6 1965-05-01
Name: date, dtype: datetime64[ns]
Gonna go ahead and answer my own question so others can use this solution if they come across this same issue. Not the greatest, but it gets the job done. It should work until 2069, so hopefully pandas will have a better solution to this by then lol
Perhaps someone else will post a better solution.
def wrong_date_preprocess(data):
"""Correct date issues with pre-1970 dates with whacky mon-yy format."""
df1 = data.copy()
dates = df1['date_column_of_interest']
# use particular datetime format with data; ex: jan-91
dates = pd.to_datetime(dates, format='%b-%y')
# look at wrongly defined python dates (pre 1970) and get indices
date_dummy = dates[dates > pd.Timestamp.today().floor('D')]
idx = list(date_dummy.index)
# fix wrong dates by offsetting 100 years back dates that defaulted to > 2069
dummy2 = date_dummy.apply(lambda x: x.replace(year=x.year - 100)).to_list()
dates.loc[idx] = dummy2
df1['date_column_of_interest'] = dates
return(df1)
Trying to convert the date (type=datetime) of a complete column into a date to use in a condition later on. The following error keeps showing up:
KeyError: Timestamp('2010-05-04 10:15:55')
Tried multiple things but I'm currently stuck with the code below.
for d in df.column:
pd.to_datetime(df.column[d]).apply(lambda x: x.date())
Also, how do I format the column so I can use it in a statement as follows:
df = df[df.column > 2015-05-28]
Just adding an answer in case anyone else ends up here :
firstly, lets create a dataframe with some dates, change the dtype into a string and convert it back. the errors='ignore' argument will ignore any non date time values in your column, so if you had John Smith in row x it would remain, on the same vein, if you changed errors='coerce' it would change John Smith into NaT (not a time value)
# Create date range with frequency of a day
rng = pd.date_range(start='01/01/18', end ='01/01/19',freq='D')
#pass this into a dataframe
df = pd.DataFrame({'Date' : rng})
print(df.dtypes)
Date datetime64[ns]
#okay lets case this into a str so we can convert it back
df['Date'] = df['Date'].astype(str)
print(df.dtypes)
Date object
# now lets convert it back #
df['Date'] = pd.to_datetime(df.Date,errors='ignore')
print(df.dtypes)
Date datetime64[ns]
# Okay lets slice the data frame for your desired date ##
print(df.loc[df.Date > '2018-12-29'))
Date
363 2018-12-30
364 2018-12-31
365 2019-01-01
The answer as provided by #Datanovice:
pd.to_datetime(df['your column'],errors='ignore')
then inspect the dtype it should be a datetime, if so, just do
df.loc[df.['your column'] > 'your-date' ]
I have a csv with a date column with dates listed as MM/DD/YY but I want to change the years from 00,02,03 to 1900, 1902, 1903 so that they are instead listed as MM/DD/YYYY
This is what works for me:
df2['Date'] = df2['Date'].str.replace(r'00', '1900')
but I'd have to do this for every year up until 68 (aka repeat this 68 times). I'm not sure how to create a loop to do the code above for every year in that range. I tried this:
ogyear=00
newyear=1900
while ogyear <= 68:
df2['date']=df2['Date'].str.replace(r'ogyear','newyear')
ogyear += 1
newyear += 1
but this returns an empty data set. Is there another way to do this?
I can't use datetime because it assumes that 02 refers to 2002 instead of 1902 and when I try to edit that as a date I get an error message from python saying that dates are immutable and that they must be changed in the original data set. For this reason I need to keep the dates as strings. I also attached the csv here in case thats helpful.
I would do it like this:
# create a data frame
d = pd.DataFrame({'date': ['20/01/00','20/01/20','20/01/50']})
# create year column
d['year'] = d['date'].str.split('/').str[2].astype(int) + 1900
# add new year into old date by replacing old year
d['new_data'] = d['date'].str.replace('[0-9]*.$','') + d['year'].astype(str)
date year new_data
0 20/01/00 1900 20/01/1900
1 20/01/20 1920 20/01/1920
2 20/01/50 1950 20/01/1950
I'd do it the following way:
from datetime import datetime
# create a data frame with dates in format month/day/shortened year
d = pd.DataFrame({'dates': ['2/01/10','5/01/20','6/01/30']})
#loop through the dates in the dates column and add them
#to list in desired form using datetime library,
#then substitute the dataframe dates column with the new ordered list
new_dates = []
for date in list(d['dates']):
dat = datetime.date(datetime.strptime(date, '%m/%d/%y'))
dat = dat.strftime("%m/%d/%Y")
new_dates.append(dat)
new_dates
d['dates'] = pd.Series(new_dates)
d
I have a data sheet in which issue_d is a date column having values stored in a format - 11-Dec. On clicking any cell of the column, date is coming as 12/11/2018.
But while reading the csv file, issue_d is getting imported as 11-Dec. Year is not getting imported.
How do I get the issue_d column in format- d/m/y?
Code i tried -
import pandas
data=pandas.read_csv('Project_data.csv')
print(data)
checking issue_d column: data['issue_d']
result :
0 11-Dec
1 11-Dec
2 11-Dec
expected:
0 11-Dec-2018
1 11-Dec-2018
2 11-Dec-201
You can use to_datetime with add year to column:
df['issue_d'] = pd.to_datetime(df['issue_d'] + '-2018')
print (df)
issue_d
0 2018-12-11
1 2018-12-11
2 2018-12-11
A more 'controllable' way of getting the data is to first get the datetime from the data frame as normal, and then convert it:
dt = dt.strftime('%Y-%m-%d')
In this case, you'd put %d in front. strftime is a great technique because it allows the most customization when converting a datetime variable, and I used it in my tutorial book - if you're a beginner to python algorithms, you should definitely check it out!
After you do this, you can splice out each individual month, day, and year, and then use
strftime("%B")
to get the string-name of the month (e.g. "February").
Good Luck!