Pandas Python: KeyError Date - python

I am import into python where it will automatically create a date time object.
However I want the first column to be a datetime object in Python. Data looks like
Date,cost
41330.66667,100
41331.66667,101
41332.66667,102
41333.66667,103
Current code looks like:
from datetime import datetime
import pandas as pd
data = pd.read_csv(r"F:\Sam\PJ\CSV2.csv")
data['Date'].apply(lambda x: datetime.strptime(x, '%d/%m/%Y'))
print(data)

This looks like an excel datetime format. This is called a serial date. To convert from that serial date you can do this:
data['Date'].apply(lambda x: datetime.fromtimestamp( (x - 25569) *86400.0))
Which outputs:
>>> data['Date'].apply(lambda x: datetime.fromtimestamp( (x - 25569) *86400.0))
0 2013-02-25 10:00:00.288
1 2013-02-26 10:00:00.288
2 2013-02-27 10:00:00.288
3 2013-02-28 10:00:00.288
To assign it to data['Date'] you just do:
data['Date'] = data['Date'].apply(lambda x: datetime.fromtimestamp( (x - 25569) *86400.0))
#df
Date cost
0 2013-02-25 16:00:00.288 100
1 2013-02-26 16:00:00.288 101
2 2013-02-27 16:00:00.288 102
3 2013-02-28 16:00:00.288 103

Unfortunately, read_csv does not cope with date columns given as numbers.
But the good news is that Pandas does have a suitable function to do it.
After read_csv call:
df.Date = pd.to_datetime(df.Date - 25569, unit='D').dt.round('ms')
As I undestand, your Date is actually the number of days since 30.12.1899
(plus fractional part of the day).
The above "correction factor" (25569) works OK. For Date == 0 it gives
just the above start of Excel epoch date.
Rounding to miliseconds (or maybe even seconds) is advisable.
Otherwise you will get weird effects resulting from inaccurate rounding
of fractional parts of day.
E.g. 0.33333333 corresponding to 8 hours can be computed as
07:59:59.999712.

Well you have two problems here.
We don't know what data and columns the CSV has, but in order for pandas to pick up the date as a column, it must be a column on that csv file.
Apply doesn't work in place. You would have to assign the result of apply back to date, as
data['Date'] = data['Date'].apply(lambda x: datetime.strptime(x, '%d/%m/%Y'))

Related

python pandas converting UTC integer to datetime

I am calling some financial data from an API which is storing the time values as (I think) UTC (example below):
enter image description here
I cannot seem to convert the entire column into a useable date, I can do it for a single value using the following code so I know this works, but I have 1000's of rows with this problem and thought pandas would offer an easier way to update all the values.
from datetime import datetime
tx = int('1645804609719')/1000
print(datetime.utcfromtimestamp(tx).strftime('%Y-%m-%d %H:%M:%S'))
Any help would be greatly appreciated.
Simply use pandas.DataFrame.apply:
df['date'] = df.date.apply(lambda x: datetime.utcfromtimestamp(int(x)/1000).strftime('%Y-%m-%d %H:%M:%S'))
Another way to do it is by using pd.to_datetime as recommended by Panagiotos in the comments:
df['date'] = pd.to_datetime(df['date'],unit='ms')
You can use "to_numeric" to convert the column in integers, "div" to divide it by 1000 and finally a loop to iterate the dataframe column with datetime to get the format you want.
import pandas as pd
import datetime
df = pd.DataFrame({'date': ['1584199972000', '1645804609719'], 'values': [30,40]})
df['date'] = pd.to_numeric(df['date']).div(1000)
for i in range(len(df)):
df.iloc[i,0] = datetime.utcfromtimestamp(df.iloc[i,0]).strftime('%Y-%m-%d %H:%M:%S')
print(df)
Output:
date values
0 2020-03-14 15:32:52 30
1 2022-02-25 15:56:49 40

Modifying format of rows values in Pandas Data-frame

I have a dataset of 70000+ data points (see picture)
As you can see, in the column 'date' half of the format is different (more messy) compared to the other half (more clear). How can I make the whole format as the second half of my data frame?
I know how to do it manually, but it will take ages!
Thanks in advance!
EDIT
df['date'] = df['date'].apply(lambda x: dt.datetime.fromtimestamp(int(str(x)) / 1000).strftime('%Y-%m-%d %H:%M:%S') if str(x).isdigit() else x)
Date is in a strange format
[
EDIT 2
two data formats:
2012-01-01 00:00:00
2020-07-21T22:45:00+00:00
I've tried the below and it works, note that this assuming two key assumptions:
1- Your date fromat follows one and ONLY ONE of the TWO formats in your example!
2- The final output is a string!
If so, this should do the trick, else, it's a starting point and can be altered to you want it to look like:
import pandas as pd
import datetime
#data sample
d = {'date':['20090602123000', '20090602124500', '2020-07-22 18:45:00+00:00', '2020-07-22 19:00:00+00:00']}
#create dataframe
df = pd.DataFrame(data = d)
print(df)
date
0 20090602123000
1 20090602124500
2 2020-07-22 18:45:00+00:00
3 2020-07-22 19:00:00+00:00
#loop over records
for i, row in df.iterrows():
#get date
dateString = df.at[i,'date']
#check if it's the undesired format or the desired format
#NOTE i'm using the '+' substring to identify that, this comes to my first assumption above that you only have two formats and that should work
if '+' not in dateString:
#reformat datetime
#NOTE: this is comes to my second assumption where i'm producing it into a string format to add the '+00:00'
df['date'].loc[df.index == i] = str(datetime.datetime.strptime(dateString, '%Y%m%d%H%M%S')) + '+00:00'
else:
continue
print(df)
date
0 2009-06-02 12:30:00+00:00
1 2009-06-02 12:45:00+00:00
2 2020-07-22 18:45:00+00:00
3 2020-07-22 19:00:00+00:00
you can format the first part of your dataframe
import datetime as dt
df['date'] = df['date'].apply(lambda x: dt.datetime.fromtimestamp(int(str(x)) / 1000).strftime('%Y-%m-%d %H:%M:%S') if str(x).isdigit() else x)
this checks if all characters of the value are digits, then format the date as the second part
EDIT
the timestamp seems to be in miliseconds while they should be in seconds => / 1000

Select nearest date first day of month in a python dataframe

i have this kind of dataframe
These data represents the value of an consumption index generally encoded once a month (at the end or at the beginning of the following month) but sometimes more. This value can be resetted to "0" if the counter is out and be replaced. Moreover some month no data is available.
I would like select only one entry per month but this entry has to be the nearest to the first day of the month AND inferior to the 15th day of the month (because if the day is higher it could be the measure of the end of the month). Another condition is that if the difference between two values is negative (the counter has been replaced), the value need to be kept even if the date is not the nearest day near the first day of month.
For example, the output data need to be
The purpose is to calculate only a consumption per month.
A solution is to parse the dataframe (as a array) and perform some if conditions statements. However i wonder if there is "simple" alternative to achieve that.
Thank you
You can normalize the month data with MonthEnd and then drop duplicates based off that column and keep the last value.
from pandas.tseries.offsets import MonthEnd
df.New = df.Index + MonthEnd(1)
df.Diff = abs((df.Index - df.New).dt.days)
df = df.sort_values(df.New, df.Diff)
df = df.drop_duplicates(subset='New', keep='first').drop(['New','Diff'], axis=1)
That should do the trick, but I was not able to test, so please copy and past the sample data into StackOverFlow if this isn't doing the job.
Defining dataframe, converting index to datetime, defining helper columns,
using them to run shift method to conditionally remove rows, and finally removing the helper columns:
from pandas.tseries.offsets import MonthEnd, MonthBegin
import pandas as pd
from datetime import datetime as dt
import numpy as np
df = pd.DataFrame([
[1254],
[1265],
[1277],
[1301],
[1345],
[1541]
], columns=["Value"]
, index=[dt.strptime("05-10-19", '%d-%m-%y'),
dt.strptime("29-10-19", '%d-%m-%y'),
dt.strptime("30-10-19", '%d-%m-%y'),
dt.strptime("04-11-19", '%d-%m-%y'),
dt.strptime("30-11-19", '%d-%m-%y'),
dt.strptime("03-02-20", '%d-%m-%y')
]
)
early_days = df.loc[df.index.day < 15]
early_month_end = early_days.index - MonthEnd(1)
early_day_diff = early_days.index - early_month_end
late_days = df.loc[df.index.day >= 15]
late_month_end = late_days.index + MonthBegin(1)
late_day_diff = late_month_end - late_days.index
df["day_offset"] = (early_day_diff.append(late_day_diff) / np.timedelta64(1, 'D')).astype(int)
df["start_of_month"] = df.index.day < 15
df["month"] = df.index.values.astype('M8[D]').astype(str)
df["month"] = df["month"].str[5:7].str.lstrip('0')
# df["month_diff"] = df["month"].astype(int).diff().fillna(0).astype(int)
df = df[df["month"].shift().ne(df["month"].shift(-1))]
df = df.drop(columns=["day_offset", "start_of_month", "month"])
print(df)
Returns:
Value
2019-10-05 1254
2019-10-30 1277
2019-11-04 1301
2019-11-30 1345
2020-02-03 1541

Convert from float to datetime in Python

I have a dataframe which datatype is float64 and I want to change it to datetime 64. But the result is return to only one day : 1970-01-01 no matter which method I use. Any help please
df.product_first_sold_date = [41245,0, 37659.0,40487.0,41701.0,40649.0]
dt.cv = pd.to_datetime(df.product_first_sold_date)
dt.cv
dt.cv2 = df.product_first_sold_date.apply(lambda x: datetime.fromtimestamp(x).strftime('%m-%d-%Y') if x==x else None)
dt.cv2
I believe you re dealing with Excel date type which is the number of days since 1900-01-01, as #Dishin pointed out 1899-12-30
# sample data:
df = pd.DataFrame({'date':[41245,37659,40487]})
# convert - adjust 1900-01-01 to the correct day
df['date'] = pd.to_timedelta(df.date, unit='D') + pd.to_datetime('1899-12-30')
Output:
date
0 2012-12-02
1 2003-02-07
2 2010-11-05

Accessing the datetime.time (00:00 - 23:59) format in a numeric data type

I have a 24-HR averaged data that is indexed according from 00:00 - 23:59, at an interval of 1 minute. This leads to 1440 data points corresponding to each minute. I want to map these timestamps to their numerical indices ranging from 0-1440 (as there 1440 minutes in the entire day).
For example, 00:00 ->0, 00:01->1, 00:02->2 ...23:58->1339, 23:59->1440
time = 01:11 dtype:datetime.time
time.func()
71
I tried to search if there is any such functionality in pandas for the datetime.time format. But, I couldn't find any.
If there is no in-built functionality in pandas for this, the other way might be to write a function that maps the specific datetime.time to an index (0-1440) .
pandas doesn't have a native time dtype, but it does have timedelta:
In [11]: t = dt.time(10, 15)
In [12]: t.hour * 60 + t.minute # total minutes (this may suffice!)
Out[12]: 615
In [13]: pd.to_timedelta((t.hour * 60 + t.minute), unit='m')
Out[13]: Timedelta('0 days 10:15:00')
Note: You may be able to work from timedelta from the start (either in parsing or by calculation):
In [14]: pd.to_timedelta('10:15:00')
Out[14]: Timedelta('0 days 10:15:00')
Take a look and see if this is what you wanted:
import pandas as pd
df = pd.DataFrame(['23:57', '10:39', '4:03'], columns=['Time'])
This data frame looks like:
Time
0 23:57
1 10:39
2 4:03
Then we can apply this function on our column:
df['Time'].apply(lambda x: int(pd.to_timedelta(pd.to_datetime(x, format='%H:%M').strftime('%H:%M:00'), unit='m').total_seconds()/60))
Of which the output is:
0 1437
1 639
2 243
Name: Time, dtype: int64
Here we use apply to apply the same function to all elements of the column.
Convert to datetime format (here I specified the format using '%H:%M' to ensure we were explicitly converting the time to hours and minutes.
Format the time with the additional seconds element by adding ':00' using strftime, this is because pd.to_timedelta will want the time in 'hh:mm:ss' format
Then we get the total_seconds() of the timedelta and divide by 60 to get minutes
Convert to an integer to get your final format.

Categories

Resources