Remove decimal from year value in a data frame - python

I currently have a 'Year Built' column in a df describing the date buildings were built, when I imported the csv file the years all have decimal places after them: 1920.0, 1985.0 . How to I go about changing them into a datetime format or just removing the decimal place?
df1['Year Built'].head()
0 1920.0
1 1985.0
2 NaN
3 1930.0
4 1985.0
Name: Year Built, dtype: float64
When I tried to use datetime...
df1['Year Built'] = pd.to_datetime(df1['Year Built'])
# check
df1['Year Built'].unique()
array(['1970-01-01T00:00:00.000001920', '1970-01-01T00:00:00.000001985',
'NaT', '1970-01-01T00:00:00.000001930',
'1970-01-01T00:00:00.000001986', '1970-01-01T00:00:00.000001987',
'1970-01-01T00:00:00.000001988', '1970-01-01T00:00:00.000001990',

Add parameter format by %Y for match YYYY and also errors='coerce' for convert not matched values to misisng values NaT:
df1['Year Built'] = pd.to_datetime(df1['Year Built'], format='%Y', errors='coerce')
print (df1)
Year Built
0 1920-01-01
1 1985-01-01
2 NaT
3 1930-01-01
4 1985-01-01

you can simply change them from float to int (and later string if you wont be processing them as numbers)
df1['Year Built'] = df1['Year Built'].astype(int)
and here is the link for more details
https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.astype.html

Similar to Lion Nadej Ahmed answer you can use the dtype parameter when you read in the data, specifying int to prevent the year from becoming a float.
https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.read_csv.html

We can simply use datetime
import datetime
df1['year_built'] = pd.to_datetime(df1['year_built'])
print(df1)

Related

how to enter manually a Python dataframe with daily dates in a correct format

I would like to (manually) create in Python a dataframe with daily dates (in column 'date') as per below code.
But the code does not provide the correct format for the daily dates, neglects dates (the desired format representation is below).
Could you please advise how I can correct the code so that the 'date' column is entered in a desired format?
Thanks in advance!
------------------------------------------------------
desired format for date column
2021-03-22 3
2021-04-07 3
2021-04-18 3
2021-05-12 0
------------------------------------------------------
df1 = pd.DataFrame({"date": [2021-3-22, 2021-4-7, 2021-4-18, 2021-5-12],
"x": [3, 3, 3, 0 ]})
df1
date x
0 1996 3
1 2010 3
2 1999 3
3 2004 0
Python wants to interpret the numbers in the sequence 2021-3-22 as a series of mathematical operations 2021 minus 3 minus 22.
If you want that item to be stored as a string that resembles a date you will need to mark them as string literal datatype (str), as shown below by encapsulating them with quotes.
import pandas as pd
df1 = pd.DataFrame({"date": ['2021-3-22', '2021-4-7', '2021-4-18', '2021-5-12'],
"x": [3, 3, 3, 0 ]})
The results for the date column, as shown here indicate that the date column contains elements of the object datatype which encompasses str in pandas. Notice that the strings were created exactly as shown (2021-3-22 instead of 2021-03-22).
0 2021-3-22
1 2021-4-7
2 2021-4-18
3 2021-5-12
Name: date, dtype: object
IF however, you actually want them stored as datetime objects so that you can do datetime manipulations on them (i.e. determine the number of days between two dates OR filter by a specific month OR year) then you need to convert the values to datetime objects.
This technique will do that:
df1['date'] = pd.to_datetime(df1['date'])
The results of this conversion are Pandas datetime objects which enable nanosecond precision (I differentiate this from Python datetime objects which are limited to microsecond precision).
0 2021-03-22
1 2021-04-07
2 2021-04-18
3 2021-05-12
Name: date, dtype: datetime64[ns]
Notice the displayed results are now formatted just as you would expect of datetimes (2021-03-22 instead of 2021-3-22).
You would want to create the series as a datetime and use the following codes when doing so as strings, more info here pandas.to_datetime:
df1 = pd.DataFrame({"date": pd.to_datetime(["2021-3-22", "2021-4-7", "2021-4-18", "2021-5-12"]),
"x": [3, 3, 3, 0 ]})
FWIW, I often use pd.read_csv(io.StringIO(text)) to copy/paste tabular-looking data into a DataFrame (for example, from SO questions).
Example:
import io
import re
import pandas as pd
def df_read(txt, **kwargs):
txt = '\n'.join([s.strip() for s in txt.splitlines()])
return pd.read_csv(io.StringIO(re.sub(r' +', '\t', txt)), sep='\t', **kwargs)
txt = """
date value
2021-03-22 3
2021-04-07 3
2021-04-18 3
2021-05-12 0
"""
df = df_read(txt, parse_dates=['date'])
>>> df
date value
0 2021-03-22 3
1 2021-04-07 3
2 2021-04-18 3
3 2021-05-12 0
>>> df.dtypes
date datetime64[ns]
value int64
dtype: object

How to count the occurrences of a string starts with a specific substring from comma separated values in a pandas data frame?

I am new to Python. I am working with a dataframe (360000 rows and 2 columns) that looks something like this:
business_id date
P01 2019-07-6 , 2018-06-05, 2019-07-06...
P02 2016-03-6 , 2019-04-10
P03 2019-01-02
The date column has dates separated by comma and dates from year 2010-2019. I am trying to count only the dates for each month that are in year 2019 for each business id. Specifically, I am looking for the output:
Can anyone please help me? Thanks.
You can do as follows
first use str.split to separate the dates in each cell to a list,
then explode to flatten the lists
convert to datetime with pd.to_datetime and extract the month
finally use pd.crosstab to pivot/count the months and join.
Altogether:
s = pd.to_datetime(df['date'].str.split('\s*,\s*').explode()).dt.to_period('M')
out = pd.crosstab(s.index,s )
# this gives the expected output
df.join(out)
Output (out):
date 2016-03 2018-06 2019-01 2019-04 2019-07
row_0
0 0 1 0 0 2
1 1 0 0 1 0
2 0 0 1 0 0
If they are not datetime objects yet, you may want to start by converting the column (series) to datetime:
pd.to_datetime()
Note: the format parameter.
Then you can access the datetime attributes through .dt
i.e df[df.COLUMN_NAME.dt.month == 5]

Pandas: how to change only one column which is in a series contain same column name [duplicate]

I have a Dataframe, df, with the following column:
df['ArrivalDate'] =
...
936 2012-12-31
938 2012-12-29
965 2012-12-31
966 2012-12-31
967 2012-12-31
968 2012-12-31
969 2012-12-31
970 2012-12-29
971 2012-12-31
972 2012-12-29
973 2012-12-29
...
The elements of the column are pandas.tslib.Timestamp.
I want to just include the year and month. I thought there would be simple way to do it, but I can't figure it out.
Here's what I've tried:
df['ArrivalDate'].resample('M', how = 'mean')
I got the following error:
Only valid with DatetimeIndex or PeriodIndex
Then I tried:
df['ArrivalDate'].apply(lambda(x):x[:-2])
I got the following error:
'Timestamp' object has no attribute '__getitem__'
Any suggestions?
Edit: I sort of figured it out.
df.index = df['ArrivalDate']
Then, I can resample another column using the index.
But I'd still like a method for reconfiguring the entire column. Any ideas?
If you want new columns showing year and month separately you can do this:
df['year'] = pd.DatetimeIndex(df['ArrivalDate']).year
df['month'] = pd.DatetimeIndex(df['ArrivalDate']).month
or...
df['year'] = df['ArrivalDate'].dt.year
df['month'] = df['ArrivalDate'].dt.month
Then you can combine them or work with them just as they are.
The df['date_column'] has to be in date time format.
df['month_year'] = df['date_column'].dt.to_period('M')
You could also use D for Day, 2M for 2 Months etc. for different sampling intervals, and in case one has time series data with time stamp, we can go for granular sampling intervals such as 45Min for 45 min, 15Min for 15 min sampling etc.
You can directly access the year and month attributes, or request a datetime.datetime:
In [15]: t = pandas.tslib.Timestamp.now()
In [16]: t
Out[16]: Timestamp('2014-08-05 14:49:39.643701', tz=None)
In [17]: t.to_pydatetime() #datetime method is deprecated
Out[17]: datetime.datetime(2014, 8, 5, 14, 49, 39, 643701)
In [18]: t.day
Out[18]: 5
In [19]: t.month
Out[19]: 8
In [20]: t.year
Out[20]: 2014
One way to combine year and month is to make an integer encoding them, such as: 201408 for August, 2014. Along a whole column, you could do this as:
df['YearMonth'] = df['ArrivalDate'].map(lambda x: 100*x.year + x.month)
or many variants thereof.
I'm not a big fan of doing this, though, since it makes date alignment and arithmetic painful later and especially painful for others who come upon your code or data without this same convention. A better way is to choose a day-of-month convention, such as final non-US-holiday weekday, or first day, etc., and leave the data in a date/time format with the chosen date convention.
The calendar module is useful for obtaining the number value of certain days such as the final weekday. Then you could do something like:
import calendar
import datetime
df['AdjustedDateToEndOfMonth'] = df['ArrivalDate'].map(
lambda x: datetime.datetime(
x.year,
x.month,
max(calendar.monthcalendar(x.year, x.month)[-1][:5])
)
)
If you happen to be looking for a way to solve the simpler problem of just formatting the datetime column into some stringified representation, for that you can just make use of the strftime function from the datetime.datetime class, like this:
In [5]: df
Out[5]:
date_time
0 2014-10-17 22:00:03
In [6]: df.date_time
Out[6]:
0 2014-10-17 22:00:03
Name: date_time, dtype: datetime64[ns]
In [7]: df.date_time.map(lambda x: x.strftime('%Y-%m-%d'))
Out[7]:
0 2014-10-17
Name: date_time, dtype: object
If you want the month year unique pair, using apply is pretty sleek.
df['mnth_yr'] = df['date_column'].apply(lambda x: x.strftime('%B-%Y'))
Outputs month-year in one column.
Don't forget to first change the format to date-time before, I generally forget.
df['date_column'] = pd.to_datetime(df['date_column'])
SINGLE LINE: Adding a column with 'year-month'-paires:
('pd.to_datetime' first changes the column dtype to date-time before the operation)
df['yyyy-mm'] = pd.to_datetime(df['ArrivalDate']).dt.strftime('%Y-%m')
Accordingly for an extra 'year' or 'month' column:
df['yyyy'] = pd.to_datetime(df['ArrivalDate']).dt.strftime('%Y')
df['mm'] = pd.to_datetime(df['ArrivalDate']).dt.strftime('%m')
Extracting the Year say from ['2018-03-04']
df['Year'] = pd.DatetimeIndex(df['date']).year
The df['Year'] creates a new column. While if you want to extract the month just use .month
You can first convert your date strings with pandas.to_datetime, which gives you access to all of the numpy datetime and timedelta facilities. For example:
df['ArrivalDate'] = pandas.to_datetime(df['ArrivalDate'])
df['Month'] = df['ArrivalDate'].values.astype('datetime64[M]')
#KieranPC's solution is the correct approach for Pandas, but is not easily extendible for arbitrary attributes. For this, you can use getattr within a generator comprehension and combine using pd.concat:
# input data
list_of_dates = ['2012-12-31', '2012-12-29', '2012-12-30']
df = pd.DataFrame({'ArrivalDate': pd.to_datetime(list_of_dates)})
# define list of attributes required
L = ['year', 'month', 'day', 'dayofweek', 'dayofyear', 'weekofyear', 'quarter']
# define generator expression of series, one for each attribute
date_gen = (getattr(df['ArrivalDate'].dt, i).rename(i) for i in L)
# concatenate results and join to original dataframe
df = df.join(pd.concat(date_gen, axis=1))
print(df)
ArrivalDate year month day dayofweek dayofyear weekofyear quarter
0 2012-12-31 2012 12 31 0 366 1 4
1 2012-12-29 2012 12 29 5 364 52 4
2 2012-12-30 2012 12 30 6 365 52 4
Thanks to jaknap32, I wanted to aggregate the results according to Year and Month, so this worked:
df_join['YearMonth'] = df_join['timestamp'].apply(lambda x:x.strftime('%Y%m'))
Output was neat:
0 201108
1 201108
2 201108
There is two steps to extract year for all the dataframe without using method apply.
Step1
convert the column to datetime :
df['ArrivalDate']=pd.to_datetime(df['ArrivalDate'], format='%Y-%m-%d')
Step2
extract the year or the month using DatetimeIndex() method
pd.DatetimeIndex(df['ArrivalDate']).year
df['Month_Year'] = df['Date'].dt.to_period('M')
Result :
Date Month_Year
0 2020-01-01 2020-01
1 2020-01-02 2020-01
2 2020-01-03 2020-01
3 2020-01-04 2020-01
4 2020-01-05 2020-01
df['year_month']=df.datetime_column.apply(lambda x: str(x)[:7])
This worked fine for me, didn't think pandas would interpret the resultant string date as date, but when i did the plot, it knew very well my agenda and the string year_month where ordered properly... gotta love pandas!
Then I tried:
df['ArrivalDate'].apply(lambda(x):x[:-2])
I think here the proper input should be string.
df['ArrivalDate'].astype(str).apply(lambda(x):x[:-2])

Extract Day, Month and Hour from Timestamp string in Python [duplicate]

I have a Dataframe, df, with the following column:
df['ArrivalDate'] =
...
936 2012-12-31
938 2012-12-29
965 2012-12-31
966 2012-12-31
967 2012-12-31
968 2012-12-31
969 2012-12-31
970 2012-12-29
971 2012-12-31
972 2012-12-29
973 2012-12-29
...
The elements of the column are pandas.tslib.Timestamp.
I want to just include the year and month. I thought there would be simple way to do it, but I can't figure it out.
Here's what I've tried:
df['ArrivalDate'].resample('M', how = 'mean')
I got the following error:
Only valid with DatetimeIndex or PeriodIndex
Then I tried:
df['ArrivalDate'].apply(lambda(x):x[:-2])
I got the following error:
'Timestamp' object has no attribute '__getitem__'
Any suggestions?
Edit: I sort of figured it out.
df.index = df['ArrivalDate']
Then, I can resample another column using the index.
But I'd still like a method for reconfiguring the entire column. Any ideas?
If you want new columns showing year and month separately you can do this:
df['year'] = pd.DatetimeIndex(df['ArrivalDate']).year
df['month'] = pd.DatetimeIndex(df['ArrivalDate']).month
or...
df['year'] = df['ArrivalDate'].dt.year
df['month'] = df['ArrivalDate'].dt.month
Then you can combine them or work with them just as they are.
The df['date_column'] has to be in date time format.
df['month_year'] = df['date_column'].dt.to_period('M')
You could also use D for Day, 2M for 2 Months etc. for different sampling intervals, and in case one has time series data with time stamp, we can go for granular sampling intervals such as 45Min for 45 min, 15Min for 15 min sampling etc.
You can directly access the year and month attributes, or request a datetime.datetime:
In [15]: t = pandas.tslib.Timestamp.now()
In [16]: t
Out[16]: Timestamp('2014-08-05 14:49:39.643701', tz=None)
In [17]: t.to_pydatetime() #datetime method is deprecated
Out[17]: datetime.datetime(2014, 8, 5, 14, 49, 39, 643701)
In [18]: t.day
Out[18]: 5
In [19]: t.month
Out[19]: 8
In [20]: t.year
Out[20]: 2014
One way to combine year and month is to make an integer encoding them, such as: 201408 for August, 2014. Along a whole column, you could do this as:
df['YearMonth'] = df['ArrivalDate'].map(lambda x: 100*x.year + x.month)
or many variants thereof.
I'm not a big fan of doing this, though, since it makes date alignment and arithmetic painful later and especially painful for others who come upon your code or data without this same convention. A better way is to choose a day-of-month convention, such as final non-US-holiday weekday, or first day, etc., and leave the data in a date/time format with the chosen date convention.
The calendar module is useful for obtaining the number value of certain days such as the final weekday. Then you could do something like:
import calendar
import datetime
df['AdjustedDateToEndOfMonth'] = df['ArrivalDate'].map(
lambda x: datetime.datetime(
x.year,
x.month,
max(calendar.monthcalendar(x.year, x.month)[-1][:5])
)
)
If you happen to be looking for a way to solve the simpler problem of just formatting the datetime column into some stringified representation, for that you can just make use of the strftime function from the datetime.datetime class, like this:
In [5]: df
Out[5]:
date_time
0 2014-10-17 22:00:03
In [6]: df.date_time
Out[6]:
0 2014-10-17 22:00:03
Name: date_time, dtype: datetime64[ns]
In [7]: df.date_time.map(lambda x: x.strftime('%Y-%m-%d'))
Out[7]:
0 2014-10-17
Name: date_time, dtype: object
If you want the month year unique pair, using apply is pretty sleek.
df['mnth_yr'] = df['date_column'].apply(lambda x: x.strftime('%B-%Y'))
Outputs month-year in one column.
Don't forget to first change the format to date-time before, I generally forget.
df['date_column'] = pd.to_datetime(df['date_column'])
SINGLE LINE: Adding a column with 'year-month'-paires:
('pd.to_datetime' first changes the column dtype to date-time before the operation)
df['yyyy-mm'] = pd.to_datetime(df['ArrivalDate']).dt.strftime('%Y-%m')
Accordingly for an extra 'year' or 'month' column:
df['yyyy'] = pd.to_datetime(df['ArrivalDate']).dt.strftime('%Y')
df['mm'] = pd.to_datetime(df['ArrivalDate']).dt.strftime('%m')
Extracting the Year say from ['2018-03-04']
df['Year'] = pd.DatetimeIndex(df['date']).year
The df['Year'] creates a new column. While if you want to extract the month just use .month
You can first convert your date strings with pandas.to_datetime, which gives you access to all of the numpy datetime and timedelta facilities. For example:
df['ArrivalDate'] = pandas.to_datetime(df['ArrivalDate'])
df['Month'] = df['ArrivalDate'].values.astype('datetime64[M]')
#KieranPC's solution is the correct approach for Pandas, but is not easily extendible for arbitrary attributes. For this, you can use getattr within a generator comprehension and combine using pd.concat:
# input data
list_of_dates = ['2012-12-31', '2012-12-29', '2012-12-30']
df = pd.DataFrame({'ArrivalDate': pd.to_datetime(list_of_dates)})
# define list of attributes required
L = ['year', 'month', 'day', 'dayofweek', 'dayofyear', 'weekofyear', 'quarter']
# define generator expression of series, one for each attribute
date_gen = (getattr(df['ArrivalDate'].dt, i).rename(i) for i in L)
# concatenate results and join to original dataframe
df = df.join(pd.concat(date_gen, axis=1))
print(df)
ArrivalDate year month day dayofweek dayofyear weekofyear quarter
0 2012-12-31 2012 12 31 0 366 1 4
1 2012-12-29 2012 12 29 5 364 52 4
2 2012-12-30 2012 12 30 6 365 52 4
Thanks to jaknap32, I wanted to aggregate the results according to Year and Month, so this worked:
df_join['YearMonth'] = df_join['timestamp'].apply(lambda x:x.strftime('%Y%m'))
Output was neat:
0 201108
1 201108
2 201108
There is two steps to extract year for all the dataframe without using method apply.
Step1
convert the column to datetime :
df['ArrivalDate']=pd.to_datetime(df['ArrivalDate'], format='%Y-%m-%d')
Step2
extract the year or the month using DatetimeIndex() method
pd.DatetimeIndex(df['ArrivalDate']).year
df['Month_Year'] = df['Date'].dt.to_period('M')
Result :
Date Month_Year
0 2020-01-01 2020-01
1 2020-01-02 2020-01
2 2020-01-03 2020-01
3 2020-01-04 2020-01
4 2020-01-05 2020-01
df['year_month']=df.datetime_column.apply(lambda x: str(x)[:7])
This worked fine for me, didn't think pandas would interpret the resultant string date as date, but when i did the plot, it knew very well my agenda and the string year_month where ordered properly... gotta love pandas!
Then I tried:
df['ArrivalDate'].apply(lambda(x):x[:-2])
I think here the proper input should be string.
df['ArrivalDate'].astype(str).apply(lambda(x):x[:-2])

Convert content of Object datatype to Date datatype in Python

I am using Jupyter Notebook, Pandas framework and Python as the programming language.
I have a dataframe which is of the following shape (10500, 4). So it has 4 columns and 10500 records.
Initial_Date is one out of the 4 columns which is an Object datatype. This is the type of information it contains:
Initial_Date
1971
11969
102006
03051992
00131954
27001973
45061987
1996
It is easy to make out the format of the column as DDMMYYYY (03051992 is 3rd May 1992)
Note: As you can see there are invalid MM (00 and 13) and invalid DD (00 and 45).
I would like to use regex to extract whatever is available in the field. I don't know how to read YYYY separately to MM or DD so please enlighten me here. After the extraction occurs, I would like to test whether the YYYY, DD and MM are valid. If either of them are not valid then assign NaT else DD-MM-YYYY or DD/MM/YYYY (not fussy with the end format).
For example: 051992 is considered as invalid since this becomes DD/05/1992
A field that has full 8 digits for example 10081996 is considered valid 10/08/1996
PS. I am starting out with Pandas, Jupyter notebook and slowing reviving my Python skills. FYI If you guys think there is a better way to convert each field to a valid Date datatype then please do enlighten me.
you can do it this way:
result = pd.to_datetime(d.Initial_Date.astype(str), dayfirst=True, errors='coerce')
result.ix[result.isnull()] = pd.to_datetime(d.Initial_Date.astype(str), format='%d%m%Y', dayfirst=True, errors='coerce')
#format is set to %d%m%Y
result:
In [88]: result
Out[88]:
0 1971-01-01
1 NaT
2 2006-10-20
3 1992-03-05
4 1954-01-03
5 NaT
6 NaT
7 1996-01-01
Name: Initial_Date, dtype: datetime64[ns]
original DF
In [89]: d
Out[89]:
Initial_Date
0 1971
1 11969
2 102006
3 3051992
4 131954
5 27001973
6 45061987
7 1996

Categories

Resources