How Can I Extract The Year From Object Column?

How Can I Extract The Year From Object Column? - python

So I have a data frame in python. I want to make a new column that has solely the year from the column found here.
The column is not in datetime format or anything due to the country listed at the tail end and I've tried using split() like so:
df['new_column'] = df['column_name'].astype(str).split(",", 3)[2]
but apparently, that doesn't work on objects.
Again, the columns are listed like so:
October 1, 2020 (United States)
April 27, 2019 (Cameroon)
but are type object and not string.
It is primarily the differing lengths in the countries at the end that has kept me from pulling from index like so:
df['new_column'] = df['column_name'].astype(str).str[x:x]
Thank You!

You can convert a column to datetime with pandas.to_datetime(). You can:
pass the format as strftime. Check out the documentation here.
If you do not know the format, you can use infer_datetime_format=True. However be careful while using this parameter, because it may convert in wrong order.
After that year can be extracted as follows:
# Create sample df:
df = pd.DataFrame({
'id': [1, 2],
'date': [
'April 27, 2019 (Cameroon)',
'October 1, 2020 (United States)'
]
})
# Remove country names
df['new_date'] = df['date'].apply(lambda x: str(x).split(' (')[0])
print(df)
Output:
id date new_date
0 1 April 27, 2019 (Cameroon) April 27, 2019
1 2 October 1, 2020 (United States) October 1, 2020
Then new_date can be converted to datetime:
df['new_date'] = pd.to_datetime(df['new_date'], infer_datetime_format=True)
df.info()
Output:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2 entries, 0 to 1
Data columns (total 3 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 id 2 non-null int64
1 date 2 non-null object
2 new_date 2 non-null datetime64[ns]
Then we can extract year from new_date:
df['year'] = df['new_date'].apply(lambda x: x.year)
Here is the final df:
id date new_date year
0 1 April 27, 2019 (Cameroon) 2019-04-27 2019
1 2 October 1, 2020 (United States) 2020-10-01 2020

Related

Converting different columns to a datetime

The time in my csv file is divided into 4 columns, (year, julian day, hour/minut(utc) and second), and I wanted to convert to a single column so that it looks like this: 14/11/2017 00:16:00.
Is there a easy way to do this?
A sample of the code is
cols = [0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16]
D14 = pd.read_csv(r'C:\Users\William Jacondino\Desktop\DadosTimeSeries\PIRATA-PROFILE\Dados FLUXO\Dados_brutos_copy-20220804T151840Z-002\Dados_brutos_copy\tm_data_2017_11_14_0016.dat', header=None, usecols=cols, names=["Year","Julian day", "Hour/minut (UTC)", "Second", "Bateria (V)", "PTemp (°C)", "Latitude", "Longitude", "Magnectic_Variation (arb)", "Altitude (m)", "Course (º)", "WS", "Nmbr_of_Satellites (arb)", "RAD", "Tar", "UR", "slp",], sep=',')
D14 = D14.loc[:, ["Year","Julian day", "Hour/minut (UTC)", "Second", "Latitude", "Longitude","WS", "RAD", "Tar", "UR", "slp"]]
My array looks like that:
The file: csv file sample
The "Hour/minut (UTC)" column has the first two digits referring to the Local Time and the last two digits referring to the minute.
The beginning of the time in the "Hour/minut (UTC)" column starts at 016 which refers to 0 hour UTC and minute 16.
and goes up to hour 12 UTC and minute 03.
I wanted to unify everything into a single datetime column so from the beginning to the end of the array:
1 - 2017
1412 - 14/11/2017 12:03:30
but the column "Hour/minut (UTC)" from hour 0 to hour 9 only has one value like this:
9
instead of 09
How do I create the array with the correct datetime?

You can create a new column which also adds the data from other columns.
For example, if you have a dataframe like so:
df = pd.DataFrame(dict)
# Print df:
year month day a b c
0 2010 jan 1 1 4 7
1 2010 feb 2 2 5 8
2 2020 mar 3 3 6 9
You can add a new column field on the DataFrame, with the values extracted from the Year Month and Date columns.
df['newColumn'] = df.year.astype(str) + '-' + df.month + '-' + df.day.astype(str)
Edit: In your situation instead of using df.month use df['Julian Day'] since the column name is different. To understand more on why this is, read here
The data in the new column will be as string with the way you like to format it. You can also substitute the dash '-' with a slash '/' or however you need to format the outcome. You just need to convert the integers into strings with .astype(str)
Output:
year month day a b c newColumn
0 2010 jan 1 1 4 7 2010-jan-1
1 2010 feb 2 2 5 8 2010-feb-2
2 2020 mar 3 3 6 9 2020-mar-3
After that you can do anything as you would on a dataframe object.
If you only need it for data analysis you can do it with the function .groupBy() which groups the data fields and performs the analysis.
source

If your dataframe looks like
import pandas as pd
df = pd.DataFrame({
"year": [2017, 2017], "julian day": [318, 318], "hour/minut(utc)": [16, 16],
"second": [0, 30],
})
year julian day hour/minut(utc) second
0 2017 318 16 0
1 2017 318 16 30
then you could use pd.to_datetime() and pd.to_timedelta() to do
df["datetime"] = (
pd.to_datetime(df["year"].astype("str"), format="%Y")
+ pd.to_timedelta(df["julian day"] - 1, unit="days")
+ pd.to_timedelta(df["hour/minut(utc)"], unit="minutes")
+ pd.to_timedelta(df["second"], unit="seconds")
).dt.strftime("%d/%m/%Y %H:%M:%S")
and get
year julian day hour/minut(utc) second datetime
0 2017 318 16 0 14/11/2017 00:16:00
1 2017 318 16 30 14/11/2017 00:16:30
The column datetime now contains strings. Remove the .dt.strftime("%d/%m/%Y %H:%M:%S") part at the end, if you want datetimes instead.
Regarding your comment: If I understand correctly, you could try the following:
df["hours_min"] = df["hour/minut(utc)"].astype("str").str.zfill(4)
df["hour"] = df["hours_min"].str[:2].astype("int")
df["minute"] = df["hours_min"].str[2:].astype("int")
df = df.drop(columns=["hours_min", "hour/minut(utc)"])
df["datetime"] = (
pd.to_datetime(df["year"].astype("str"), format="%Y")
+ pd.to_timedelta(df["julian day"] - 1, unit="days")
+ pd.to_timedelta(df["hour"], unit="hours")
+ pd.to_timedelta(df["minute"], unit="minutes")
+ pd.to_timedelta(df["second"], unit="seconds")
).dt.strftime("%d/%m/%Y %H:%M:%S")
Result for the sample dataframe df
df = pd.DataFrame({
"year": [2017, 2017, 2018, 2019], "julian day": [318, 318, 10, 50],
"hour/minut(utc)": [16, 16, 234, 1201], "second": [0, 30, 1, 2],
})
year julian day hour/minut(utc) second
0 2017 318 16 0
1 2017 318 16 30
2 2018 10 234 1
3 2019 50 1201 2
would be
year julian day second hour minute datetime
0 2017 318 0 0 16 14/11/2017 00:16:00
1 2017 318 30 0 16 14/11/2017 00:16:30
2 2018 10 1 2 34 10/01/2018 02:34:01
3 2019 50 2 12 1 19/02/2019 12:01:02

Sorting the date column on a pandas dataframe that is not in date time format and has format mmm dd, yyyy

Hi all I am working with a pandas dataframe that contains a date column. I would like to sort this column by date in ascending order, meaning that the most recent date is at the bottom of the dataframe. The problem that I am running into is that the date column displays the dates in the following format:
"Nov 3, 2020"
how can I sort these dates, the suggested advice that I have found online is to convert the date into a date time format and then sort then change it back to this format. Is there a more simple way to do this? I have tried this
new_df.sort_values(by=["Date"],ascending=True)
where new_df is the dataframe, but this does not seem to work.
Any ideas on how can do this? I essentially want the output to have something like
Date
----
Oct 31, 2020
Nov 1, 2020
Nov 12,2020
.
.
.

I would reformat the date column first, then convert to datetime, and then sort:
dates = ['Nov 1, 2020','Nov 12,2020','Oct 31, 2020']
df = pd.DataFrame({'Date':dates, 'Col1':[2,3,1]})
# Date Col1
# 0 Nov 1, 2020 2
# 1 Nov 12,2020 3
# 2 Oct 31, 2020 1
df['Date'] = pd.to_datetime(df['Date'].apply(lambda x: "-".join(x.replace(',',' ').split())))
df = df.sort_values('Date')
# Date Col1
# 2 2020-10-31 1
# 0 2020-11-01 2
# 1 2020-11-12 3
# and if you want to get the dates back in their original format
df['Date'] = df['Date'].apply(lambda x: "{} {}, {}".format(x.month_name()[:3],x.day,x.year))
# Date Col1
# 2 Oct 31, 2020 1
# 0 Nov 1, 2020 2
# 1 Nov 12, 2020 3

df.sort_values(by = "Date", key = pd.to_datetime)

Pandas "to_datetime" not accepting series

I am new to pandas and am trying to convert a column of strings with dates in the format '%d %B' (01 January, 02 January .... ) to date time objects and the type of the column is <class 'pandas.core.series.Series'> .
if i pass in this series in the to_datetime method, like
print(pd.to_datetime(data_file['Date'], format='%d %B', errors="coerce"))
it all returns NaT for all the entries, where as it should return date time objects
I checked the documentation and it says that it accepts a Series object.
Any way to fix this?
Edit 1:
here is the head of the data i am using:
Date Daily Confirmed
0 30 January 1
1 31 January 0
2 01 February 0
3 02 February 1
4 03 February 1
edit 2: here is the information of the data
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 179 entries, 0 to 178
Data columns (total 2 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 Date 179 non-null object
1 Daily Confirmed 179 non-null int64
dtypes: int64(1), object(1)
memory usage: 2.2+ KB

If I understand correctly, you may be facing this issue because there are spaces around the dates in this column. To solve it, use strip before to_datetime. Here's a piece of code that does that:
df = pd.DataFrame({'Date':
['30 January ', '31 January ', ' 01 February ', '02 February',
'03 February'], 'Daily Confirmed': [1, 0, 0, 1, 1]})
pd.to_datetime(df.Date.str.strip(), format = "%d %B")
The output is:
0 1900-01-30
1 1900-01-31
2 1900-02-01
...

import pandas as pd
dic = {"Date": ["30 January", "31 January", "01 February", ] , "Daily Confirmed":[0,1,0]}
df =pd.DataFrame(dic)
df['date1'] = pd.to_datetime(df['Date'].astype(str), format='%d %B')
df
By default, it contains years as 1900. Because you did not provide year on your Dataframe
Output:
Date Daily Confirmed date1
0 30 January 0 1900-01-30
1 31 January 1 1900-01-31
2 01 February 0 1900-02-01
If you don't want year as prefix of date. Please add the below code:
df['date2']=df['date1'].dt.strftime('%d-%m')
df
Date Daily Confirmed date1 date2
0 30 January 0 1900-01-30 30-1
1 31 January 1 1900-01-31 31-1
2 01 February 0 1900-02-01 01-2
Thanks

You may try this:
from datetime import datetime
df['datetime'] = df['date'].apply(lambda x: datetime.strptime(x, "%d %B"))
apply() allows you to use python functions in series, here you may have to specify the year otherwise the default year (1900) will be set as default.
Good luck

Merge Separate columns of MM, DD, YYYY to a single column of YYYY-MM-DD using Python3.7

I have separate columns of DD, MM, YYYY.
The data is in a dataframe df and has separate columsn of Day, Month, Year in int64 format
How do I merge them to create a YYYY-MM-DD format column in Python

You can use the to_datetime method
date_data_set = [{"day":1, "month":1, "year":2020}, {"day":2, "month":3, "year":2019}]
date_data_set
Out[40]: [{'day': 1, 'month': 1, 'year': 2020}, {'day': 2, 'month': 3, 'year': 2019}]
df = pd.DataFrame(date_data_set)
df
Out[42]:
day month year
0 1 1 2020
1 2 3 2019
df['date_data'] = pd.to_datetime(df['day'].astype("str")+"/"+df['month'].astype("str")+"/"+df["year"].astype("str"), format = "%d/%m/%Y")
df
Out[44]:
day month year date_data
0 1 1 2020 2020-01-01
1 2 3 2019 2019-03-02
df.dtypes
Out[52]:
day int64
month int64
year int64
date_data datetime64[ns]
dtype: object

Imagine having the test_df as below you could insert the value of each column as an argument of dt.datetime or dt.date depending on the data type you are looking for:
import pandas as pd
import datetime as dt
test_df = pd.DataFrame(data={'years':[2019, 2018, 2018],
'months':[10, 9, 10],
'day': [20, 20, 20]})
test_df['full_date']=[dt.datetime(year, month, day) for year, month,
day in zip(test_df['years'], test_df['months'], test_df['day'])]

By pure string manipulation given that you want the final result to be a string:
# Sample data.
df = pd.DataFrame({'Year': [2018, 2019], 'Month': [12, 1], 'Day': [25, 10]})
# Solution.
>>> df.assign(
date=df.Year.astype(str)
+ '-' + df.Month.astype(str).str.zfill(2)
+ '-' + df.Day.astype(str).str.zfill(2)
)
Year Month Day date
0 2018 12 25 2018-12-25
1 2019 1 10 2019-01-10
If you prefer Timestamps instead of strings, then you can easily convert them via:
df['date'] = pd.to_datetime(df['date'])

Use to_datetime with format parameter:
Using #emiljoj setup,
test_df = pd.DataFrame(data={'years':[2019, 2018, 2018],
'months':[10, 9, 10],
'day': [20, 20, 20]})
test_df['date'] = pd.to_datetime(test_df['years'].astype('str')+
test_df['months'].astype('str')+
test_df['day'].astype('str'),
format='%Y%m%d')
Output:
years months day date
0 2019 10 20 2019-10-20
1 2018 9 20 2018-09-20
2 2018 10 20 2018-10-20

how to convert a string type to date format

My source data has a column including the date information but it is a string type.
Typical lines are like this:
04 13, 2013
07 1, 2012
I am trying to convert to a date format, so I used panda's to_datetime function:
df['ReviewDate_formated'] = pd.to_datetime(df['ReviewDate'],format='%mm%d, %yyyy')
But I got this error message:
ValueError: time data '04 13, 2013' does not match format '%mm%d, %yyyy' (match)
My questions are:
How do I convert to a date format?
I also want to extract to Month and Year and Day columns because I need to do some month over month comparison? But the problem here is the length of the string varies.

Your format string is incorrect, you want '%m %d, %Y', there is a reference that shows what the valid format identifiers are:
In [30]:
import io
import pandas as pd
t="""ReviewDate
04 13, 2013
07 1, 2012"""
df = pd.read_csv(io.StringIO(t), sep=';')
df
Out[30]:
ReviewDate
0 04 13, 2013
1 07 1, 2012
In [31]:
pd.to_datetime(df['ReviewDate'], format='%m %d, %Y')
Out[31]:
0 2013-04-13
1 2012-07-01
Name: ReviewDate, dtype: datetime64[ns]
To answer the second part, once the dtype is a datetime64 then you can call the vectorised dt accessor methods to get just the day, month, and year portions:
In [33]:
df['Date'] = pd.to_datetime(df['ReviewDate'], format='%m %d, %Y')
df['day'],df['month'],df['year'] = df['Date'].dt.day, df['Date'].dt.month, df['Date'].dt.year
df
Out[33]:
ReviewDate Date day month year
0 04 13, 2013 2013-04-13 13 4 2013
1 07 1, 2012 2012-07-01 1 7 2012

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

How Can I Extract The Year From Object Column? - python

Related

Converting different columns to a datetime

Sorting the date column on a pandas dataframe that is not in date time format and has format mmm dd, yyyy

Pandas "to_datetime" not accepting series

Merge Separate columns of MM, DD, YYYY to a single column of YYYY-MM-DD using Python3.7

how to convert a string type to date format

Categories

Resources