I am new to pandas and am trying to convert a column of strings with dates in the format '%d %B' (01 January, 02 January .... ) to date time objects and the type of the column is <class 'pandas.core.series.Series'> .
if i pass in this series in the to_datetime method, like
print(pd.to_datetime(data_file['Date'], format='%d %B', errors="coerce"))
it all returns NaT for all the entries, where as it should return date time objects
I checked the documentation and it says that it accepts a Series object.
Any way to fix this?
Edit 1:
here is the head of the data i am using:
Date Daily Confirmed
0 30 January 1
1 31 January 0
2 01 February 0
3 02 February 1
4 03 February 1
edit 2: here is the information of the data
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 179 entries, 0 to 178
Data columns (total 2 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 Date 179 non-null object
1 Daily Confirmed 179 non-null int64
dtypes: int64(1), object(1)
memory usage: 2.2+ KB
If I understand correctly, you may be facing this issue because there are spaces around the dates in this column. To solve it, use strip before to_datetime. Here's a piece of code that does that:
df = pd.DataFrame({'Date':
['30 January ', '31 January ', ' 01 February ', '02 February',
'03 February'], 'Daily Confirmed': [1, 0, 0, 1, 1]})
pd.to_datetime(df.Date.str.strip(), format = "%d %B")
The output is:
0 1900-01-30
1 1900-01-31
2 1900-02-01
...
import pandas as pd
dic = {"Date": ["30 January", "31 January", "01 February", ] , "Daily Confirmed":[0,1,0]}
df =pd.DataFrame(dic)
df['date1'] = pd.to_datetime(df['Date'].astype(str), format='%d %B')
df
By default, it contains years as 1900. Because you did not provide year on your Dataframe
Output:
Date Daily Confirmed date1
0 30 January 0 1900-01-30
1 31 January 1 1900-01-31
2 01 February 0 1900-02-01
If you don't want year as prefix of date. Please add the below code:
df['date2']=df['date1'].dt.strftime('%d-%m')
df
Date Daily Confirmed date1 date2
0 30 January 0 1900-01-30 30-1
1 31 January 1 1900-01-31 31-1
2 01 February 0 1900-02-01 01-2
Thanks
You may try this:
from datetime import datetime
df['datetime'] = df['date'].apply(lambda x: datetime.strptime(x, "%d %B"))
apply() allows you to use python functions in series, here you may have to specify the year otherwise the default year (1900) will be set as default.
Good luck
Related
The time in my csv file is divided into 4 columns, (year, julian day, hour/minut(utc) and second), and I wanted to convert to a single column so that it looks like this: 14/11/2017 00:16:00.
Is there a easy way to do this?
A sample of the code is
cols = [0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16]
D14 = pd.read_csv(r'C:\Users\William Jacondino\Desktop\DadosTimeSeries\PIRATA-PROFILE\Dados FLUXO\Dados_brutos_copy-20220804T151840Z-002\Dados_brutos_copy\tm_data_2017_11_14_0016.dat', header=None, usecols=cols, names=["Year","Julian day", "Hour/minut (UTC)", "Second", "Bateria (V)", "PTemp (°C)", "Latitude", "Longitude", "Magnectic_Variation (arb)", "Altitude (m)", "Course (º)", "WS", "Nmbr_of_Satellites (arb)", "RAD", "Tar", "UR", "slp",], sep=',')
D14 = D14.loc[:, ["Year","Julian day", "Hour/minut (UTC)", "Second", "Latitude", "Longitude","WS", "RAD", "Tar", "UR", "slp"]]
My array looks like that:
The file: csv file sample
The "Hour/minut (UTC)" column has the first two digits referring to the Local Time and the last two digits referring to the minute.
The beginning of the time in the "Hour/minut (UTC)" column starts at 016 which refers to 0 hour UTC and minute 16.
and goes up to hour 12 UTC and minute 03.
I wanted to unify everything into a single datetime column so from the beginning to the end of the array:
1 - 2017
1412 - 14/11/2017 12:03:30
but the column "Hour/minut (UTC)" from hour 0 to hour 9 only has one value like this:
9
instead of 09
How do I create the array with the correct datetime?
You can create a new column which also adds the data from other columns.
For example, if you have a dataframe like so:
df = pd.DataFrame(dict)
# Print df:
year month day a b c
0 2010 jan 1 1 4 7
1 2010 feb 2 2 5 8
2 2020 mar 3 3 6 9
You can add a new column field on the DataFrame, with the values extracted from the Year Month and Date columns.
df['newColumn'] = df.year.astype(str) + '-' + df.month + '-' + df.day.astype(str)
Edit: In your situation instead of using df.month use df['Julian Day'] since the column name is different. To understand more on why this is, read here
The data in the new column will be as string with the way you like to format it. You can also substitute the dash '-' with a slash '/' or however you need to format the outcome. You just need to convert the integers into strings with .astype(str)
Output:
year month day a b c newColumn
0 2010 jan 1 1 4 7 2010-jan-1
1 2010 feb 2 2 5 8 2010-feb-2
2 2020 mar 3 3 6 9 2020-mar-3
After that you can do anything as you would on a dataframe object.
If you only need it for data analysis you can do it with the function .groupBy() which groups the data fields and performs the analysis.
source
If your dataframe looks like
import pandas as pd
df = pd.DataFrame({
"year": [2017, 2017], "julian day": [318, 318], "hour/minut(utc)": [16, 16],
"second": [0, 30],
})
year julian day hour/minut(utc) second
0 2017 318 16 0
1 2017 318 16 30
then you could use pd.to_datetime() and pd.to_timedelta() to do
df["datetime"] = (
pd.to_datetime(df["year"].astype("str"), format="%Y")
+ pd.to_timedelta(df["julian day"] - 1, unit="days")
+ pd.to_timedelta(df["hour/minut(utc)"], unit="minutes")
+ pd.to_timedelta(df["second"], unit="seconds")
).dt.strftime("%d/%m/%Y %H:%M:%S")
and get
year julian day hour/minut(utc) second datetime
0 2017 318 16 0 14/11/2017 00:16:00
1 2017 318 16 30 14/11/2017 00:16:30
The column datetime now contains strings. Remove the .dt.strftime("%d/%m/%Y %H:%M:%S") part at the end, if you want datetimes instead.
Regarding your comment: If I understand correctly, you could try the following:
df["hours_min"] = df["hour/minut(utc)"].astype("str").str.zfill(4)
df["hour"] = df["hours_min"].str[:2].astype("int")
df["minute"] = df["hours_min"].str[2:].astype("int")
df = df.drop(columns=["hours_min", "hour/minut(utc)"])
df["datetime"] = (
pd.to_datetime(df["year"].astype("str"), format="%Y")
+ pd.to_timedelta(df["julian day"] - 1, unit="days")
+ pd.to_timedelta(df["hour"], unit="hours")
+ pd.to_timedelta(df["minute"], unit="minutes")
+ pd.to_timedelta(df["second"], unit="seconds")
).dt.strftime("%d/%m/%Y %H:%M:%S")
Result for the sample dataframe df
df = pd.DataFrame({
"year": [2017, 2017, 2018, 2019], "julian day": [318, 318, 10, 50],
"hour/minut(utc)": [16, 16, 234, 1201], "second": [0, 30, 1, 2],
})
year julian day hour/minut(utc) second
0 2017 318 16 0
1 2017 318 16 30
2 2018 10 234 1
3 2019 50 1201 2
would be
year julian day second hour minute datetime
0 2017 318 0 0 16 14/11/2017 00:16:00
1 2017 318 30 0 16 14/11/2017 00:16:30
2 2018 10 1 2 34 10/01/2018 02:34:01
3 2019 50 2 12 1 19/02/2019 12:01:02
So I have a data frame in python. I want to make a new column that has solely the year from the column found here.
The column is not in datetime format or anything due to the country listed at the tail end and I've tried using split() like so:
df['new_column'] = df['column_name'].astype(str).split(",", 3)[2]
but apparently, that doesn't work on objects.
Again, the columns are listed like so:
October 1, 2020 (United States)
April 27, 2019 (Cameroon)
but are type object and not string.
It is primarily the differing lengths in the countries at the end that has kept me from pulling from index like so:
df['new_column'] = df['column_name'].astype(str).str[x:x]
Thank You!
You can convert a column to datetime with pandas.to_datetime(). You can:
pass the format as strftime. Check out the documentation here.
If you do not know the format, you can use infer_datetime_format=True. However be careful while using this parameter, because it may convert in wrong order.
After that year can be extracted as follows:
# Create sample df:
df = pd.DataFrame({
'id': [1, 2],
'date': [
'April 27, 2019 (Cameroon)',
'October 1, 2020 (United States)'
]
})
# Remove country names
df['new_date'] = df['date'].apply(lambda x: str(x).split(' (')[0])
print(df)
Output:
id date new_date
0 1 April 27, 2019 (Cameroon) April 27, 2019
1 2 October 1, 2020 (United States) October 1, 2020
Then new_date can be converted to datetime:
df['new_date'] = pd.to_datetime(df['new_date'], infer_datetime_format=True)
df.info()
Output:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2 entries, 0 to 1
Data columns (total 3 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 id 2 non-null int64
1 date 2 non-null object
2 new_date 2 non-null datetime64[ns]
Then we can extract year from new_date:
df['year'] = df['new_date'].apply(lambda x: x.year)
Here is the final df:
id date new_date year
0 1 April 27, 2019 (Cameroon) 2019-04-27 2019
1 2 October 1, 2020 (United States) 2020-10-01 2020
I would like to create two columns "Year" and "Month" from a Date column that contains different year and month arrangements. Some are YY-Mmm and the others are Mmm-YY.
import pandas as pd
dataSet = {
"Date": ["18-Jan", "18-Jan", "18-Feb", "18-Feb", "Oct-17", "Oct-17"],
"Quantity": [3476, 20, 789, 409, 81, 640],
}
df = pd.DataFrame(dataSet, columns=["Date", "Quantity"])
My attempt is as follows:
Date1 = []
Date2 = []
for dt in df.Date:
Date1.append(dt.split("-")[0])
Date2.append(dt.split("-")[1])
Year = []
try:
for yr in Date1:
Year.append(int(yr.Date1))
except:
for yr in Date2:
Year.append(int(yr.Date2))
You can make use of the extract dataframe string method to split the date strings up. Since the year can precede or follow the month, we can get a bit creative and have a Year1 column and Year2 columns for either position. Then use np.where to create a single Year column pulls from each of these other year columns.
For example:
import numpy as np
split_dates = df["Date"].str.extract(r"(?P<Year1>\d+)?-?(?P<Month>\w+)-?(?P<Year2>\d+)?")
split_dates["Year"] = np.where(
split_dates["Year1"].notna(),
split_dates["Year1"],
split_dates["Year2"],
)
split_dates = split_dates[["Year", "Month"]]
With result for split_dates:
Year Month
0 18 Jan
1 18 Jan
2 18 Feb
3 18 Feb
4 17 Oct
5 17 Oct
Then you can merge back with your original dataframe with pd.merge, like so:
pd.merge(df, split_dates, how="inner", left_index=True, right_index=True)
Which yields:
Date Quantity Year Month
0 18-Jan 3476 18 Jan
1 18-Jan 20 18 Jan
2 18-Feb 789 18 Feb
3 18-Feb 409 18 Feb
4 Oct-17 81 17 Oct
5 Oct-17 640 17 Oct
Thank you for your help. I managed to get it working with what I've learned so far, i.e. for loop, if-else and split() and with the help of another expert.
# Split the Date column and store it in an array
dA = []
for dP in df.Date:
dA.append(dP.split("-"))
# Append month and year to respective lists based on if conditions
Month = []
Year = []
for moYr in dA:
if len(moYr[0]) == 2:
Month.append(moYr[1])
Year.append(moYr[0])
else:
Month.append(moYr[0])
Year.append(moYr[1])
This took me hours!
Try using Python datetime strptime(<date>, "%y-%b") on the date column to convert it to a Python datetime.
from datetime import datetime
def parse_dt(x):
try:
return datetime.strptime(x, "%y-%b")
except:
return datetime.strptime(x, "%b-%y")
df['timestamp'] = df['Date'].apply(parse_dt)
df
Date Quantity timestamp
0 18-Jan 3476 2018-01-01
1 18-Jan 20 2018-01-01
2 18-Feb 789 2018-02-01
3 18-Feb 409 2018-02-01
4 Oct-17 81 2017-10-01
5 Oct-17 640 2017-10-01
Then you can just use .month and .year attributes, or if you prefer the month as its abbreviated form, use Python datetime.strftime('%b').
df['year'] = df.timestamp.apply(lambda x: x.year)
df['month'] = df.timestamp.apply(lambda x: x.strftime('%b'))
df
Date Quantity timestamp year month
0 18-Jan 3476 2018-01-01 2018 Jan
1 18-Jan 20 2018-01-01 2018 Jan
2 18-Feb 789 2018-02-01 2018 Feb
3 18-Feb 409 2018-02-01 2018 Feb
4 Oct-17 81 2017-10-01 2017 Oct
5 Oct-17 640 2017-10-01 2017 Oct
With Pandas I am using this answer to clean up dates with a variety of formats. This works perfectly if I filter out the dates that are prior to 1677. However my dates are historic and many date before 1677 so I get an OutOfBoundsDatetime error.
My data contains dates like:
27 Feb 1928,
1920,
October 2000,
1500,
1625,
Mar 1723
I can see a reference here to using pd.Period but I don't know how to apply it to my case as the dates need to be cleaned first before I can adapt this sample
My code to clean the dates is:
df['clean_date'] = df.dates.apply(
lambda x: pd.to_datetime(x).strftime('%m/%d/%Y'))
df
I would like help to convert and clean my dates including the historic dates. Grateful for assistance with this.
As it is clearly stated in the online documentation you can't have values of datetime64[ns] dtype that are not falling into ['1677-09-21 00:12:43.145225', '2262-04-11 23:47:16.854775807'].
But you can have such dates as Period dtype.
Sample input dataset:
In [156]: df
Out[156]:
Date
0 27 Feb 1928
1 1920
2 October 2000
3 1500
4 1625
5 Mar 1723
In [157]: df.dtypes
Out[157]:
Date object
dtype: object
Solution:
In [158]: df["new"] = pd.PeriodIndex([pd.Period(d, freq="D") for d in df.Date])
Result:
In [159]: df
Out[159]:
Date new
0 27 Feb 1928 1928-02-27
1 1920 1920-01-01
2 October 2000 2000-10-01
3 1500 1500-01-01
4 1625 1625-01-01
5 Mar 1723 1723-03-01
In [160]: df.dtypes
Out[160]:
Date object
new period[D]
dtype: object
In [161]: df["new"].dt.year
Out[161]:
0 1928
1 1920
2 2000
3 1500
4 1625
5 1723
Name: new, dtype: int64
My source data has a column including the date information but it is a string type.
Typical lines are like this:
04 13, 2013
07 1, 2012
I am trying to convert to a date format, so I used panda's to_datetime function:
df['ReviewDate_formated'] = pd.to_datetime(df['ReviewDate'],format='%mm%d, %yyyy')
But I got this error message:
ValueError: time data '04 13, 2013' does not match format '%mm%d, %yyyy' (match)
My questions are:
How do I convert to a date format?
I also want to extract to Month and Year and Day columns because I need to do some month over month comparison? But the problem here is the length of the string varies.
Your format string is incorrect, you want '%m %d, %Y', there is a reference that shows what the valid format identifiers are:
In [30]:
import io
import pandas as pd
t="""ReviewDate
04 13, 2013
07 1, 2012"""
df = pd.read_csv(io.StringIO(t), sep=';')
df
Out[30]:
ReviewDate
0 04 13, 2013
1 07 1, 2012
In [31]:
pd.to_datetime(df['ReviewDate'], format='%m %d, %Y')
Out[31]:
0 2013-04-13
1 2012-07-01
Name: ReviewDate, dtype: datetime64[ns]
To answer the second part, once the dtype is a datetime64 then you can call the vectorised dt accessor methods to get just the day, month, and year portions:
In [33]:
df['Date'] = pd.to_datetime(df['ReviewDate'], format='%m %d, %Y')
df['day'],df['month'],df['year'] = df['Date'].dt.day, df['Date'].dt.month, df['Date'].dt.year
df
Out[33]:
ReviewDate Date day month year
0 04 13, 2013 2013-04-13 13 4 2013
1 07 1, 2012 2012-07-01 1 7 2012