Is there a better way than bdate_range() to measure business days between two columns of dates via pandas?
df = pd.DataFrame({ 'A' : ['1/1/2013', '2/2/2013', '3/3/2013'],
'B': ['1/12/2013', '4/4/2013', '3/3/2013']})
print df
df['A'] = pd.to_datetime(df['A'])
df['B'] = pd.to_datetime(df['B'])
f = lambda x: len(pd.bdate_range(x['A'], x['B']))
df['DIFF'] = df.apply(f, axis=1)
print df
With output of:
A B
0 1/1/2013 1/12/2013
1 2/2/2013 4/4/2013
2 3/3/2013 3/3/2013
A B DIFF
0 2013-01-01 00:00:00 2013-01-12 00:00:00 9
1 2013-02-02 00:00:00 2013-04-04 00:00:00 44
2 2013-03-03 00:00:00 2013-03-03 00:00:00 0
Thanks!
brian_the_bungler was onto the most efficient way of doing this using numpy's busday_count:
import numpy as np
A = [d.date() for d in df['A']]
B = [d.date() for d in df['B']]
df['DIFF'] = np.busday_count(A, B)
print df
On my machine this is 300x faster on your test case, and 1000s of times faster on much larger arrays of dates
You can use pandas' Bday offset to step through business days between two dates like this:
new_column = some_date - pd.tseries.offsets.Bday(15)
Read more in this conversation: https://stackoverflow.com/a/44288696
It also works if some_date is a single date value, not a series.
Related
I have DataFrame like below:
df = pd.DataFrame({"ID" : ["1", "2", "3"],
"Date" : ["12/11/2020", "12/10/2020", "05/04/2020"]})
And I need to calculate number of MONTHS from Date column until today. Below I upload result which I need:
You can modify this solution for subtract by scalar d:
df['Date'] = pd.to_datetime(df['Date'], dayfirst=True)
d = pd.to_datetime('now')
df['Amount'] = 12 * (d.year - df['Date'].dt.year) + d.month - df['Date'].dt.month
print (df)
ID Date Amount
0 1 2020-11-12 1
1 2 2020-10-12 2
2 3 2020-04-05 8
Try using this code that subtracts the time now with the 'Date' column, I also use np.ceil, because that rounds up a number:
df['Date'] = pd.to_datetime(df['Date'])
df['Amount'] = ((pd.to_datetime('now') - df['Date']) / np.timedelta64(1, 'M')).apply(np.ceil)
print(df)
from datetime import datetime
import pandas as pd
import numpy as np
df = pd.DataFrame({"ID" : ["1", "2", "3"],
"Date" : ["12/11/2020", "12/10/2020", "05/04/2020"]})
df['Month_diff'] = round(((datetime.now() - pd.to_datetime(df.Date,infer_datetime_format=True,dayfirst=True))/np.timedelta64(1, 'M'))-0.5)
This would be a one-liner where you are transforming the column Date to datetimeformat and then performing the operation. Output:
ID Date Month_diff
0 1 12/11/2020 1.0
1 2 12/10/2020 2.0
2 3 05/04/2020 8.0
My data frame has 6 columns of dates which i want them to in 1 column
DATA FRAME IMAGE HERE
Code to make another column is as below
df['Mega'] = df['Mega'].append(df['RsWeeks','RsMonths','RsDays','PsWeeks','PsMonths','PsDays'])
i am new to python and pandas i would like to learn more so please point me sources too as i am really bad with debugging as i have no programming background.
Pandas documentation is a great source for good examples. Click here to visit a page with a lot of examples and visuals.
For your particular case:
We construct a sample DataFrame:
import pandas as pd
df = pd.DataFrame([
{"RsWeeks": "2015-11-10", "RsMonths": "2016-08-01"},
{"RsWeeks": "2015-11-11", "RsMonths": "2015-12-30"}
])
print("DataFrame preview:")
print(df)
Output:
DataFrame preview:
RsWeeks RsMonths
0 2015-11-10 2016-08-01
1 2015-11-11 2015-12-30
We concatenate the columns RsWeeks and RsMonths to create a Series:
my_series = pd.concat([df["RsWeeks"], df["RsMonths"]], ignore_index=True)
print("\nSeries preview:")
print(my_series)
Output:
Series preview:
0 2015-11-10
1 2015-11-11
2 2016-08-01
3 2015-12-30
Edit
If you really need to add the new Series as a column to your DataFrame, you can do the following:
df2 = pd.DataFrame({"Mega": my_series})
df = pd.concat([df, df2], axis=1)
print("\nDataFrame preview:")
print(df)
Output:
DataFrame preview:
RsWeeks RsMonths Mega
0 2015-11-10 2016-08-01 2015-11-10
1 2015-11-11 2015-12-30 2015-11-11
2 NaN NaN 2016-08-01
3 NaN NaN 2015-12-30
Data:
df = pd.DataFrame({"name" : 'Dav Las Oms'.split(),
'age' : [25, 50, 70]})
df['Name'] = list(['a', 'M', 'm'])
df:
name age Name
0 Dav 25 a
1 Las 50 M
2 Oms 70 m
df = pd.DataFrame(df.astype(str).apply('|'.join, axis=1))
df:
0
0 Dav|25|a
1 Las|50|M
2 Oms|70|m
You can use pd.melt() which makes your dataframe from wide to long:
df_reshaped = pd.melt(df, id_vars = ['id_1','id_2','id_3'], var_name = 'new_name', value_name = 'Mega')
There is a huge dataframe containing multiple data types in different columns. I want to find rows that contain date values in different columns.
Here a test dataframe:
dt = pd.Series(['abc', datetime.now(), 12, '', None, np.nan, '2020-05-05'])
dt1 = pd.Series([3, datetime.now(), 'sam', '', np.nan, 'abc-123', '2020-05-25'])
dt3 = pd.Series([1,2,3,4,5,6,7])
df = pd.DataFrame({"A":dt.values, "B":dt1.values, "C":dt3.values})
Now, I want to create a new dataframe that contains only dates in both columns A and B, here rows 2nd and last.
Expected output:
A B C
1 2020-06-01 16:58:17.274311 2020-06-01 17:13:20.391394 2
6 2020-05-05 2020-05-25 7
What is the best way to do that? Thanks.
P.S.> Dates can be in any standard format.
Use:
m = df[['A', 'B']].transform(pd.to_datetime, errors='coerce').isna().any(axis=1)
df = df[~m]
Result:
# print(df)
A B C
1 2020-06-01 17:54:16.377722 2020-06-01 17:54:16.378432 2
6 2020-05-05 2020-05-25 7
Solution for test only A,B columns is boolean indexing with DataFrame.notna and DataFrame.all for not match any non datetimes:
df = df[df[['A','B']].apply(pd.to_datetime, errors='coerce').notna().all(axis=1)]
print (df)
A B C
1 2020-06-01 16:14:35.020855 2020-06-01 16:14:35.021855 2
6 2020-05-05 2020-05-25 7
import pandas as pd
from datetime import datetime
dt = pd.Series(['abc', datetime.now(), 12, '', None, np.nan, '2020-05-05'])
dt1 = pd.Series([3, datetime.now(), 'sam', '', np.nan, 'abc-123', '2020-05-25'])
dt3 = pd.Series([1,2,3,4,5,6,7])
df = pd.DataFrame({"A":dt.values, "B":dt1.values, "C":dt3.values})
m = pd.concat([pd.to_datetime(df['A'], errors='coerce'),
pd.to_datetime(df['B'], errors='coerce')], axis=1).isna().all(axis=1)
print(df[~m])
Prints:
A B C
1 2020-06-01 12:17:51.320286 2020-06-01 12:17:51.320826 2
6 2020-05-05 2020-05-25 7
I have a dataframe with let's say 2 columns: dates and doubles
2017-05-01 2.5
2017-05-02 3.5
... ...
2017-05-17 0.2
2017-05-18 2.5
Now I would like to do a groupby and sum with x rows. So i.e. with 6 rows it would return:
2017-05-06 15.6
2017-05-12 13.4
2017-05-18 18.0
Is there a clean solution to do this without running it through a for-loop with something like this:
temp = pd.DataFrame()
j = 0
for i in range(0,len(df.index),6):
temp[df.ix[i]['date']] = df.ix[i:i+6]['value'].sum()
I guess you are looking for resample. consider this dataframe
rng = pd.date_range('2017-05-01', periods=18, freq='D')
num = np.random.randint(5,size = 18)
df = pd.DataFrame({'date': rng, 'val': num})
df.resample('6D', on = 'date').sum().reset_index()
will return
date val
0 2017-05-01 14
1 2017-05-07 11
2 2017-05-13 16
This is alternative solution using groupby range of length of the dataframe.
Two columns using agg
df.groupby(np.arange(len(df))//6).agg(lambda x: {'date': x.date.iloc[0],
'value': x.value.sum()})
Multiple columns you can use first (or last) for date and sum for other columns.
group = df.groupby(np.arange(len(df))//6)
pd.concat((group['date'].first(),
group[[c for c in df.columns if c != 'date']].sum()), axis=1)
I have a data frame that has DatetimeIndex. I would like to create an input, the user will write the date, then python will look up the first passed month.
Here's an example: df is the name of the dataframe
date = input('Enter a date in YYYY-MM-DD format: ')
Enter a date in YYYY-MM-DD format: 2017-01-31
I would like that python will do df[date-1] and then print the result so that I get:
2016-12-31 8.257478e+04
It's possible if the input date is in the index already, but I'm looking find a way when the input is not.
Any ideas ? Thanks in advance
It seems you need get_loc for position of value in index and then iloc for selecting:
pos = df.index.get_loc(d)
print (df.iloc[[pos - 1]])
Sample:
start = pd.to_datetime('2016-11-30')
rng = pd.date_range(start, periods=10, freq='M')
df = pd.DataFrame({'a': range(10)}, index=rng)
print (df)
a
2016-11-30 0
2016-12-31 1
2017-01-31 2
2017-02-28 3
2017-03-31 4
2017-04-30 5
2017-05-31 6
2017-06-30 7
2017-07-31 8
2017-08-31 9
d = '2017-01-31'
pos = df.index.get_loc(d)
print (df.iloc[[pos - 1]])
a
2016-12-31 1
If date is not in index add method='nearest':
d = '2017-01-20'
pos = df.index.get_loc(d, method='nearest')
print (df.iloc[[pos - 1]])
a
2016-12-31 1
But if need more general solution you have to use some conditions like:
d = '2017-11-30'
pos = df.index.get_loc(d, method='nearest')
if pos == 0:
print ('Value less or same as minimal date in DataTimeIndex')
else:
print ('Value nearest less or same as date', df.index[pos])
print ('Previous value', df.iloc[[pos - 1]])