Incoherent handling of datetime objects in by the DataFrame.to_markdown method - python

Main objective
I want to transform a pandas.DataFrame with a single column, containing datetime.datetime objects, into its markdown representation, using the pandas.DataFrame.to_markdown method.
Issue
The markdown table does not display the date as desired (displays a timestamp instead of the expected YYYY-MM-DD HH:mm:SS). How can I make it display the date in the usual format?
Code
from datetime import datetime
import pandas as pd
df:pd.DataFrame = pd.DataFrame({
"date": [
datetime(year=2022, month=1, day=1, hour=1, minute=1, second=1),
datetime(year=2022, month=6, day=2, hour=2, minute=2, second=2),
datetime(year=2022, month=10, day=3, hour=3, minute=3, second=3)
]
})
print(df.to_markdown())
Displays
date
0
1.641e+18
1
1.65414e+18
2
1.66477e+18
Why is this "incoherent"?
When I first had to display this DataFrame, I had add one column to it, in which I inserted the year of the corresponding Timestamp object. I thus have 2 columns, with respectively pandas._libs.tslibs.timestamps.Timestamp and numpy.int64 objects in them.
When converted to markdown, it produced the desired effect by formatting the Timestamp as expected.
Code
# To add after the previous code
df["year"] = df.date.apply(lambda x: x.year)
print(df.to_markdown())
Displays
date
year
0
2022-01-01 01:01:01
2022
1
2022-06-02 02:02:02
2022
2
2022-10-03 03:03:03
2022
Lead
By checking the types by using the pandas.DataFrame.info method, and by calling the type method on the content of the individual cells, I observed that the types are not always the same. Is it normal?
For instance, the type method on the content of a year cell will show that these cells contain numpy.int64 objects, while the info method will display the content of the column as int64.
Additionally, the date column will be shown as filled of datetime64[ns] by the info method, while the type one will say the cells are pandas._libs.tslibs.timestamps.Timestamp.
Could it have any influence whatsoever?

Related

Trouble selecting records based on date in python/pandas

I have a column in my dataframe df_events called 'Program Date Time.' I've successfully created separate columns for EventDate and EventTime based on this using df_events.ProgramDateTime.dt.date and df_events.ProgramDateTime.dt.time.
My problem occurs when I try to select records between two dates. I seem to run into all kinds of type errors whatever I try.
I'm a relatively new Python/pandas user, just recently familiar with dataframes. I'm using Python3.7.
I have tried using strptime and even just trying to select records based on the original column ProgramDateTime.
I'm also writing this code in Sublime Text
import pandas as pd, numpy as np
from datetime import datetime
File_Path = 'path'
Event_csv = 'file.csv'
df_event = pd.read_csv(File_Path+Event_csv)
# Indicate analysis period
StartDate = datetime.strptime('2018-08-08', '%Y-%m-%d')
EndDate = datetime.strptime('2019-07-01', '%Y-%m-%d')
# Change appropriate column in Events dataframe to make sure in Datetime format.
df_event['ProgramDateTime'] = pd.to_datetime(df_event['ProgramDateTime'])
#Create separate columns for Event Date and Time in dataframe
df_event['EventDate'], df_event['EventTime'] = df_event.ProgramDateTime.dt.date, df_event.ProgramDateTime.dt.time
# Create dataframe of programs occurring only during analysis period
df_event_ap = df_event[df_event['EventDate']>=StartDate and df_event['EventDate']<=EndDate]
print(df_event_ap.dtypes)
print(df_even_ap.head(11))
I expect to see a new dataframe, df_events_ap, containing only those records that are between StartDate and EndDate.
Instead, the problem happens just as Python's supposed to select the records (the code underneath the last comment (#) line.)
I get this error:
TypeError: can't compare datetime.datetime to datetime.date
The first thing that I can spot is your df_event['EventDate']. This needs to be once again converted into datetime format:
df_event['EventDate'] = pd.to_datetime(df_event['EventDate'])
Then do:
from datetime import datetime
StartDate = datetime.strptime('2018-08-08', '%Y-%m-%d')
EndDate = datetime.strptime('2019-07-01', '%Y-%m-%d')
Now that StartDate, EndDate and df_event['EventDate'] are all in the same format, you have to do:
df_event_ap = df_event[(df_event['EventDate']>=StartDate) & (df_event['EventDate']<=EndDate)]
You will now get your both outputs:
Output for first print statement:
print(df_event_ap.dtypes)
ProgramDateTime datetime64[ns]
EventDate datetime64[ns]
EventTime object
dtype: object
Output for second print statement:
print(df_event_ap.head(11))
ProgramDateTime EventDate EventTime
0 2018-12-20 12:46:52 2018-12-20 12:46:52
2 2018-12-25 12:46:52 2018-12-25 12:46:52
4 2018-11-20 12:46:52 2018-11-20 12:46:52
5 2018-12-10 12:46:52 2018-12-10 12:46:52

KeyError: Timestamp when converting date in column to date

Trying to convert the date (type=datetime) of a complete column into a date to use in a condition later on. The following error keeps showing up:
KeyError: Timestamp('2010-05-04 10:15:55')
Tried multiple things but I'm currently stuck with the code below.
for d in df.column:
pd.to_datetime(df.column[d]).apply(lambda x: x.date())
Also, how do I format the column so I can use it in a statement as follows:
df = df[df.column > 2015-05-28]
Just adding an answer in case anyone else ends up here :
firstly, lets create a dataframe with some dates, change the dtype into a string and convert it back. the errors='ignore' argument will ignore any non date time values in your column, so if you had John Smith in row x it would remain, on the same vein, if you changed errors='coerce' it would change John Smith into NaT (not a time value)
# Create date range with frequency of a day
rng = pd.date_range(start='01/01/18', end ='01/01/19',freq='D')
#pass this into a dataframe
df = pd.DataFrame({'Date' : rng})
print(df.dtypes)
Date datetime64[ns]
#okay lets case this into a str so we can convert it back
df['Date'] = df['Date'].astype(str)
print(df.dtypes)
Date object
# now lets convert it back #
df['Date'] = pd.to_datetime(df.Date,errors='ignore')
print(df.dtypes)
Date datetime64[ns]
# Okay lets slice the data frame for your desired date ##
print(df.loc[df.Date > '2018-12-29'))
Date
363 2018-12-30
364 2018-12-31
365 2019-01-01
The answer as provided by #Datanovice:
pd.to_datetime(df['your column'],errors='ignore')
then inspect the dtype it should be a datetime, if so, just do
df.loc[df.['your column'] > 'your-date' ]

How to add last day of the month for each month in python

I have my data in the following format:
final.head(5)
(Head of the data, displaying sales for each month from May 2015)
I want to add the last day of the month for each record and want an output like this
transactionDate sale_price_after_promo
05/30/2015 30393.8
06/31/2015 24345.68
07/30/2015 26688.91
08/31/2015 46626.1
09/30/2015 27933.84
10/31/2015 76087.55
I tried this
pd.Series(pd.DatetimeIndex(start=final.start_time, end=final.end_time, freq='M')).to_frame('transactionDate')
But getting an error
'DataFrame' object has no attribute 'start_time'
Create PeriodIndex and then convert it to_timestamp:
df = pd.DataFrame({'transactionDate':['2015-05','2015-06','2015-07']})
df['date'] = pd.PeriodIndex(df['transactionDate'], freq='M').to_timestamp(how='end')
print (df)
transactionDate date
0 2015-05 2015-05-31
1 2015-06 2015-06-30
2 2015-07 2015-07-31
I am attempting to convert dynamically all date columns to YYYY-MM-DD format using dataframe that come from read_csv. columns are below.
input
empno,ename,hiredate,report_date,end_date
1,sreenu,17-Jun-2021,18/06/2021,May-22
output
empno,ename,hiredate,report_date,end_date
1,sreenu,2021-06-17,2021-06-18,2022-05-31
rules are
if date is MMM-YY or MM-YYYY(May-22 or 05-2022) (then last day of the month(YYYY-MM-DD format - 2022-05-31)
other than point 1 then it should be YYYY-MM-DD
Now i want create a method/function to identify all date datatype columns in dataframe then convert to YYYY-MM-DD format/user expected format.

I gets an error when I calculate time intervals using diff(periods=1) in pandas

I have a pandas dataframe that records times of events that occur from today's 08:00 AM to tomorrow's 07:00 AM, each day(Therefore, I don't want to add date values, to save the storage and to simply matintain it). So, it looks like this:
>>> df.Time[63010:]
63010 23:59:59.431256 # HH:MM:SS.ffffff
63011 23:59:59.431256
63012 23:59:59.431256
63013 23:59:59.431256
63014 23:59:59.431256
63015 23:59:59.618764
63016 23:59:59.821756
63017 23:59:59.821756
63018 23:59:59.821756
63019 23:59:59.821756
63020 00:00:00.025058 # date changes here
63021 00:00:00.025058
63022 00:00:00.025058
63023 00:00:00.228202
63024 00:00:00.228202
63025 00:00:00.228202
63026 00:00:00.228202
.....
I want to make a new dataframe that records time intervals between each event, so I tried:
>>> TimeDiff = df.Time.diff(periods=1)
But it gets a value that I don't intend to get, which is:
63018 00:00:00
63019 00:00:00
63020 -1 days +00:00:00.203302 <-- -1 days?
63021 00:00:00
63022 00:00:00
I know that it happens because I don't have date values. How can I fix this problem without adding dates?
If you know that your error is due to missing date values then you should try pandas build in function to_datetime:
Example: df['date_col'] = pd.to_datetime(df['date_col'])
you can also adjust the format of the date by adding a format argument like so:
Example: df['date_col'] = pd.to_datetime(df['date_col'], format="%m/%d/%Y)

How do I change the Date but not the Time of a Timestamp within a dataframe column?

Python 3.6.0
I am importing a file with Unix timestamps.
I’m converting them to Pandas datetime and rounding to 10 minutes (12:00, 12:10, 12:20,…)
The data is collected from within a specified time period, but from different dates.
For our analysis, we want to change all dates to the same dates before doing a resampling.
At present we have a reduce_to_date that is the target for all dates.
current_date = pd.to_datetime('2017-04-05') #This will later be dynamic
reduce_to_date = current_date - pd.DateOffset(days=7)
I’ve tried to find an easy way to change the date in a series without changing the time.
I was trying to avoid lengthy conversions with .strftime().
One method that I've almost settled is to add the reduce_to_date and df['Timestamp'] difference to df['Timestamp']. However, I was trying to use the .date() function and that only works on a single element, not on the series.
GOOD!
passed_df['Timestamp'][0] = passed_df['Timestamp'][0] + (reduce_to_date.date() - passed_df['Timestamp'][0].date())
NOT GOOD
passed_df['Timestamp'][:] = passed_df['Timestamp'][:] + (reduce_to_date.date() - passed_df['Timestamp'][:].date())
AttributeError: 'Series' object has no attribute 'date'
I can use a loop:
x=1
for line in passed_df['Timestamp']:
passed_df['Timestamp'][x] = line + (reduce_to_date.date() - line.date())
x+=1
But this throws a warning:
C:\Users\elx65i5\Documents\Lightweight Logging\newmain.py:60: SettingWithCopyWarning:
A value is trying to be set on a copy of a slice from a DataFrame
See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
The goal is to have all dates the same, but leave the original time.
If we can simply specify the replacement date, that’s great.
If we can use mathematics and change each date according to a time delta, equally as great.
Can we accomplish this in a vectorized fashion without using .strftime() or a lengthy procedure?
If I understand correctly, you can simply subtract an offset
passed_df['Timestamp'] -= pd.offsets.Day(7)
demo
passed_df=pd.DataFrame(dict(
Timestamp=pd.to_datetime(['2017-04-05 15:21:03', '2017-04-05 19:10:52'])
))
# Make sure your `Timestamp` column is datetime.
# Mine is because I constructed it that way.
# Use
# passed_df['Timestamp'] = pd.to_datetime(passed_df['Timestamp'])
passed_df['Timestamp'] -= pd.offsets.Day(7)
print(passed_df)
Timestamp
0 2017-03-29 15:21:03
1 2017-03-29 19:10:52
using strftime
Though this is not ideal, I wanted to make a point that you absolutely can use strftime. When your column is datetime, you can use strftime via the dt date accessor with dt.strftime. You can create a dynamic column where you specify the target date like this:
pd.to_datetime(passed_df.Timestamp.dt.strftime('{} %H:%M:%S'.format('2017-03-29')))
0 2017-03-29 15:21:03
1 2017-03-29 19:10:52
Name: Timestamp, dtype: datetime64[ns]
I think you need convert df['Timestamp'].dt.date to_datetime, because output of date is python date object, not pandas datetime object:
df=pd.DataFrame({'Timestamp':pd.to_datetime(['2017-04-05 15:21:03','2017-04-05 19:10:52'])})
print (df)
Timestamp
0 2017-04-05 15:21:03
1 2017-04-05 19:10:52
current_date = pd.to_datetime('2017-04-05')
reduce_to_date = current_date - pd.DateOffset(days=7)
df['Timestamp'] = df['Timestamp'] - reduce_to_date + pd.to_datetime(df['Timestamp'].dt.date)
print (df)
Timestamp
0 2017-04-12 15:21:03
1 2017-04-12 19:10:52

Categories

Resources