I have a basic code snippet that I need to recreate in pandas:
import datetime as dt
date1 = dt.date(2016,10,10)
date2 = dt.date.today()
print('Week#', round((date2 - date1).days / 7 +.5))
# output: Week# 36
I have a datetime64[ns] datatype column and I cannot crack it. Using this basic example I'm stumped:
import pandas as pd
import numpy as np
import datetime as dt
dfp = pd.DataFrame({'A' : [dt.date(2016,10,6)]})
dfp['A'] = pd.to_datetime(dfp['A'])
def week(col):
print((col.dt.date - dt.date(2015, 10, 6)))
week(dfp['A']) #output: 366 days
When I try re-creating the week number calculation I'm running into multiple errors: print((col.dt.date - dt.date(2015, 10, 6)).days) returns AttributeError: 'Series' object has no attribute 'days'
I'd like to try and solve this on my own so I can learn from the pain.
How do I return the pandas column values in terms of "number of days" or as an int like using the first calculation in the first code snippet? (ie, instead of 366 days, just 366)
If you're feeling adventurous how could i get the Week# xxx output in a more efficient way?
I think you forget .dt:
dfp = pd.DataFrame({'A' : [date2]})
dfp['A'] = pd.to_datetime(dfp['A'])
print (dfp)
print (((dfp['A'].dt.date - dt.date(2016, 10, 10)).dt.days / 7 + .5).round().astype(int))
0 36
Name: A, dtype: int32
Related
Ultimately I want to calculate the number of days to the last day of the month from every date in df['start'] and populate the 'count' column with the result.
As a first step towards that goal the calendar.monthrange
method takes (year, month) arguments and returns a (first weekday, number of days) tuple.
There seems to be a general mistake regarding applying functions to dataframes or series objects. I would like to understand, why this isn't working.
import numpy as np
import pandas as pd
import calendar
def last_day(row):
return calendar.monthrange(row['start'].dt.year, row['start'].dt.month)
This line raises an AttributeError: "Timestamp object has no attribute 'dt'":
df['count'] = df.apply(last_day, axis=1)
this is what my dataframe looks like:
start count
0 2016-02-15 NaN
1 2016-02-20 NaN
2 2016-04-23 NaN
df.dtypes
start datetime64[ns]
count float64
dtype: object
Remove the .dt. This is generally needed when accessing a vector of some sort. But when accessing an individual element it will already be a datetime object:
Code:
def last_day(row):
return calendar.monthrange(row['start'].year, row['start'].month)
Why:
This apply calls last_day and passes a Series.
df['count'] = df.apply(last_day, axis=1)
In last_day you then select a single element of the series:
row['start'].year
I would do it like this:
from pandas.tseries.offsets import MonthEnd
## sample data
d = pd.DataFrame({'start':['2016-02-15','2016-02-20','2016-04-23']})
## solution
d['start'] = pd.to_datetime(d['start'])
d['end'] = d['start'] + MonthEnd(1)
d['count'] = (d['start'] - d['end']) / np.timedelta64(-1, 'D')
i have this kind of dataframe
These data represents the value of an consumption index generally encoded once a month (at the end or at the beginning of the following month) but sometimes more. This value can be resetted to "0" if the counter is out and be replaced. Moreover some month no data is available.
I would like select only one entry per month but this entry has to be the nearest to the first day of the month AND inferior to the 15th day of the month (because if the day is higher it could be the measure of the end of the month). Another condition is that if the difference between two values is negative (the counter has been replaced), the value need to be kept even if the date is not the nearest day near the first day of month.
For example, the output data need to be
The purpose is to calculate only a consumption per month.
A solution is to parse the dataframe (as a array) and perform some if conditions statements. However i wonder if there is "simple" alternative to achieve that.
Thank you
You can normalize the month data with MonthEnd and then drop duplicates based off that column and keep the last value.
from pandas.tseries.offsets import MonthEnd
df.New = df.Index + MonthEnd(1)
df.Diff = abs((df.Index - df.New).dt.days)
df = df.sort_values(df.New, df.Diff)
df = df.drop_duplicates(subset='New', keep='first').drop(['New','Diff'], axis=1)
That should do the trick, but I was not able to test, so please copy and past the sample data into StackOverFlow if this isn't doing the job.
Defining dataframe, converting index to datetime, defining helper columns,
using them to run shift method to conditionally remove rows, and finally removing the helper columns:
from pandas.tseries.offsets import MonthEnd, MonthBegin
import pandas as pd
from datetime import datetime as dt
import numpy as np
df = pd.DataFrame([
[1254],
[1265],
[1277],
[1301],
[1345],
[1541]
], columns=["Value"]
, index=[dt.strptime("05-10-19", '%d-%m-%y'),
dt.strptime("29-10-19", '%d-%m-%y'),
dt.strptime("30-10-19", '%d-%m-%y'),
dt.strptime("04-11-19", '%d-%m-%y'),
dt.strptime("30-11-19", '%d-%m-%y'),
dt.strptime("03-02-20", '%d-%m-%y')
]
)
early_days = df.loc[df.index.day < 15]
early_month_end = early_days.index - MonthEnd(1)
early_day_diff = early_days.index - early_month_end
late_days = df.loc[df.index.day >= 15]
late_month_end = late_days.index + MonthBegin(1)
late_day_diff = late_month_end - late_days.index
df["day_offset"] = (early_day_diff.append(late_day_diff) / np.timedelta64(1, 'D')).astype(int)
df["start_of_month"] = df.index.day < 15
df["month"] = df.index.values.astype('M8[D]').astype(str)
df["month"] = df["month"].str[5:7].str.lstrip('0')
# df["month_diff"] = df["month"].astype(int).diff().fillna(0).astype(int)
df = df[df["month"].shift().ne(df["month"].shift(-1))]
df = df.drop(columns=["day_offset", "start_of_month", "month"])
print(df)
Returns:
Value
2019-10-05 1254
2019-10-30 1277
2019-11-04 1301
2019-11-30 1345
2020-02-03 1541
I have a dataframe with the date and month_diff variable. I would like to get a new date (name it as Target_Date) based on the following logic:
For example, the date is 2/13/2019, month_diff is 3, then the target date should be the month-end of the original date plus 3 months, which is 5/31/2019
I tried the following method to get the traget date first:
df["Target_Date"] = df["Date"] + pd.DateOffset(months = df["month_diff"])
But it failed, as I know, the parameter in the dateoffset should be a varaible or a fixed number.
I also tried:
df["Target_Date"] = df["Date"] + relativedelta(months = df["month_diff"])
It failes too.
Anyone can help? thank you.
edit:
this is a large dataset with millions rows.
You could try this
import pandas as pd
from dateutil.relativedelta import relativedelta
df = pd.DataFrame({'Date': [pd.datetime(2019,1,1), pd.datetime(2019,2,1)], 'month_diff': [1,2]})
df.apply(lambda row: row.Date + relativedelta(months=row.month_diff), axis=1)
Or list comprehension
[date + relativedelta(months=month_diff) for date, month_diff in df[['Date', 'month_diff']].values]
I would approach in the following method to compute your "target_date".
Apply the target month offset (in your case +3months), using your pd.DateOffset.
Get the last day of that target month (using for example calendar.monthrange, see also "Get last day of the month"). This will provide you with the "flexible" part of that date" offset.
Apply the flexible day offset, when comparing the result of step 1. and step 2. This could be a new pd.DateOffset.
A solution could look something like this:
import calendar
from dateutil.relativedelta import relativedelta
for ii in df.index:
new_ = df.at[ii, 'start_date'] + relativedelta(months=df.at[ii, 'month_diff'])
max_date = calendar.monthrange(new_.year, new_.month)[1]
end_ = new_ + relativedelta(days=max_date - new_.day)
print(end_)
Further "cleaning" into a function and / or list comprehension will probably make it much faster
import pandas as pd
from datetime import datetime
from datetime import timedelta
This is my approach in solving your issue.
However for some reason I am getting a semantic error in my output even though I am sure it is the correct way. Please everyone correct me if you notice something wrong.
today = datetime.now()
today = today.strftime("%d/%m/%Y")
month_diff =[30,5,7]
n = 30
for i in month_diff:
b = {'Date': today, 'month_diff':month_diff,"Target_Date": datetime.now()+timedelta(days=i*n)}
df = pd.DataFrame(data=b)
Output:
For some reason the i is not getting updated.
I was looking for a solution I can write in one line only and apply does the job. However, by default apply function performs action on each column, so you have to remember to specify correct axis: axis=1.
from datetime import datetime
from dateutil.relativedelta import relativedelta
# Create a new column with date adjusted by number of months from 'month_diff' column and later adjust to the last day of month
df['Target_Date'] = df.apply(lambda row: row.Date # to current date
+ relativedelta(months=row.month_diff) # add month_diff
+ relativedelta(day=+31) # and adjust to the last day of month
, axis=1) # 1 or ‘columns’: apply function to each row.
My dataframe has two columns. When I subtract them to get the month in between, I got some weird numbers. Here is an example:
test = pd.DataFrame({'reg_date': [datetime(2017,3,1), datetime(2016,9,1)],
'leave_date':[datetime(2017,7,1), datetime(2017,6,1)]})
test['diff_month'] = test.leave_date.dt.month - test.reg_date.dt.month
test
The output:
If a user's register_date is last year, I get a negative number (also incorrect as well).
What operations should I perform to get the correct time difference in month between two datetime column?
Update: I changed the example a bit so it reflects more about the issue I am facing. Don't down vote so fast guys.
A hack I did to fix this is:
test['real_diff'] = test.diff_month.apply(lambda x: x if x > 0 else 12+x)
I don't like the hack so I am curious if there is any other way of doing it.
IIUC you can call apply and use relativedelta as #zipa suggested:
In[29]:
from dateutil import relativedelta
test['real_diff'] = test.apply(lambda row: relativedelta.relativedelta(row['leave_date'], row['reg_date']).months, axis=1)
test
Out[29]:
leave_date reg_date real_diff
0 2017-07-01 2017-03-01 4
1 2017-06-01 2016-09-01 9
To get your result you can use relativedelta from dateutil:
import datetime
from dateutil import relativedelta
a = datetime.datetime(2016, 12, 1)
b = datetime.datetime(2017, 5, 1)
relativedelta.relativedelta(b, a).months
#5
I need to calculate hour difference between two dates (format: year-month-dayTHH:MM:SS I could also potentially transform data format to (format: year-month-day HH:MM:SS) from huge excel file. What is the most efficient way to do it in Python? I have tried to use Datatime/Time object (TypeError: expected string or buffer), Timestamp (ValueError) and DataFrame (does not give hour result).
Excel File:
Order_Date Received_Customer Column3
2000-10-06T13:00:58 2000-11-06T13:00:58 1
2000-10-21T15:40:15 2000-12-27T10:09:29 2
2000-10-23T10:09:29 2000-10-26T10:09:29 3
..... ....
Datatime/Time object code (TypeError: expected string or buffer):
import pandas as pd
import time as t
data=pd.read_excel('/path/file.xlsx')
s1 = (data,['Order_Date'])
s2 = (data,['Received_Customer'])
s1Time = t.strptime(s1, "%Y:%m:%d:%H:%M:%S")
s2Time = t.strptime(s2, "%Y:%m:%d:%H:%M:%S")
deltaInHours = (t.mktime(s2Time) - t.mktime(s1Time))
print deltaInHours, "hours"
Timestamp (ValueError) code:
import pandas as pd
import datetime as dt
data=pd.read_excel('/path/file.xlsx')
df = pd.DataFrame(data,columns=['Order_Date','Received_Customer'])
df.to = [pd.Timestamp('Order_Date')]
df.fr = [pd.Timestamp('Received_Customer')]
(df.fr-df.to).astype('timedelta64[h]')
DataFrame (does not return the desired result)
import pandas as pd
data=pd.read_excel('/path/file.xlsx')
df = pd.DataFrame(data,columns=['Order_Date','Received_Customer'])
df['Order_Date'] = pd.to_datetime(df['Order_Date'])
df['Received_Customer'] = pd.to_datetime(df['Received_Customer'])
answer = df.dropna()['Order_Date'] - df.dropna()['Received_Customer']
answer.astype('timedelta64[h]')
print(answer)
Output:
0 24 days 16:38:07
1 0 days 00:00:00
2 20 days 12:39:52
dtype: timedelta64[ns]
Should be something like this:
0 592 hour
1 0 hour
2 492 hour
Is there another way to convert timedelta64[ns] into hours than answer.astype('timedelta64[h]')?
For each of your solutions you mixed up datatypes and methods. Whereas I do not find the time to explicitly explain your mistakes, yet i want to help you by providing a (probably non optimal) solution.
I built the solution out of your previous tries and I combined it with knowledge from other questions such as:
Convert a timedelta to days, hours and minutes
Get total number of hours from a Pandas Timedelta?
Note that i used Python 3. I hope that my solution guides your way. My solution is this one:
import pandas as pd
from datetime import datetime
import numpy as np
d = pd.read_excel('C:\\Users\\nrieble\\Desktop\\check.xlsx',header=0)
start = [pd.to_datetime(e) for e in data['Order_Date'] if len(str(e))>4]
end = [pd.to_datetime(e) for e in data['Received_Customer'] if len(str(e))>4]
delta = np.asarray(s2Time)-np.asarray(s1Time)
deltainhours = [e/np.timedelta64(1, 'h') for e in delta]
print (deltainhours, "hours")