I have a requirement where I have to find number of days between 2 months where 1st month value is constant and 2nd month value is present in a data frame.
I have to subtract 24th Feb with values present in the Data Frame.
past_2_month = date.today()
def to_integer(dt_time):
return 1*dt_time.month
past_2_month = to_integer(past_2_month)
past_2_month_num = past_2_month-2
day = 24
date_2 = dt.date(year, past_2_month_num, day)
date_2
Output of above code: datetime.date(2022, 2, 24)
Other values present in the Data frame is below:
dict_1 = {'Col1' : ['2017-05-01', np.NaN, '2017-11-01', np.NaN, '2016-10-01']}
a = pd.DataFrame(dict_1)
How to subtract this 2 values so that I can get difference in days between these 2 values?
If need number of days between datetime column and 2 months shifted values use offsets.DateOffset and convert timedeltas to days by Series.dt.days:
a['Col1'] = pd.to_datetime(a['Col1'])
a['new'] = (a['Col1'] - (a['Col1'] - pd.DateOffset(months=2))).dt.days
print (a)
Col1 new
0 2017-05-01 61.0
1 NaT NaN
2 2017-11-01 61.0
3 NaT NaN
4 2016-10-01 61.0
If need difference by another datetime solution is simplier - subtract and convert values to days:
a['Col1'] = pd.to_datetime(a['Col1'])
a['new'] = (pd.to_datetime('2022-02-24') - a['Col1']).dt.days
print (a)
Col1 new
0 2017-05-01 1760.0
1 NaT NaN
2 2017-11-01 1576.0
3 NaT NaN
4 2016-10-01 1972.0
Related
Let's say I have a pandas df with a Date column (datetime64[ns]):
Date rows_num
0 2020-01-01 NaN
1 2020-02-25 NaN
2 2020-04-23 NaN
3 2020-06-28 NaN
4 2020-08-17 NaN
5 2020-10-11 NaN
6 2020-12-06 NaN
7 2021-01-26 7.0
8 2021-03-17 7.0
I want to get a column (rows_num in the above example) with the number of rows I need to claw back to find the current row date minus 365 days (1 year before).
So, in the above example, for index 7 (date 2021-01-26) I want to know how many rows before I can find the date 2020-01-26.
If a perfect match is not available (like in the example df), I should reference the closest available date (or the closest smaller/larger date: it doesn't really matter in my case).
Any idea? Thanks
Edited to reflect OP's original question. Created a demo dataframe. Created a column to hold that row_count value to reflect number of business days. Then, for each row, create a filter to grab all rows between the start date and 365 days later. the shape[0] of that filtered dataframe represents the number of business days, and we add it into the appropriate field of the df.
# Import Pandas package
import pandas as pd
from datetime import datetime, timedelta
# Create a sample dataframe
df = pd.DataFrame({'num_posts': [4, 6, 3, 9, 1, 14, 2, 5, 7, 2],
'date' : ['2020-08-09', '2020-08-25', '2020-09-05',
'2020-09-12', '2020-09-29', '2020-10-15',
'2020-11-21', '2020-12-02', '2020-12-10',
'2020-12-18']})
#create the column for the row count:
df.insert(2, "row_count", '')
# Convert the date to datetime64
df['date'] = pd.to_datetime(df['date'], format='%Y-%m-%d')
for row in range(len(df['date'])):
start_date = str(df['date'].iloc[row])
end_date = str(df['date'].iloc[row] + timedelta(days=365)) #set the end date for the filter
# Filter data between two dates
filtered_df = df.loc[(df['date'] >= start_date) & (df['date'] < end_date)]
df['row_count'][row] = filtered_df.shape[0] # fill in the row_count column with the number of rows returned by your filter
source
You can use pd.merge_asof, which performs the exact nearest-match lookup you describe. You can even choose to use backward (smaller), forward (larger), or nearest search types.
# setup
text = StringIO(
"""
Date
2020-01-01
2020-02-25
2020-04-23
2020-06-28
2020-08-17
2020-10-11
2020-12-06
2021-01-26
2021-03-17
"""
)
data = pd.read_csv(text, delim_whitespace=True, parse_dates=["Date"])
# calculate the reference date from 1 year (365 days) ago
one_year_ago = data["Date"] - pd.Timedelta("365D")
# we only care about the index values for the original and matched dates
merged = pd.merge_asof(
one_year_ago.reset_index(),
data.reset_index(),
on="Date",
suffixes=("_original", "_matched"),
direction="backward",
)
data["rows_num"] = merged["index_original"] - merged["index_matched"]
Result:
Date rows_num
0 2020-01-01 NaN
1 2020-02-25 NaN
2 2020-04-23 NaN
3 2020-06-28 NaN
4 2020-08-17 NaN
5 2020-10-11 NaN
6 2020-12-06 NaN
7 2021-01-26 7.0
8 2021-03-17 7.0
I'm looking to add a %Y%m%d date column to my dataframe using a period column that has integers 1-32, which represent monthly data points starting at a defined environment variable "odate" (e.g. if odate=20190531 then period 1 should be 20190531, period 2 should be 20190630, etc.)
I tried defining a dictionary with the number of periods in the column as the keys and the value being odate + MonthEnd(period -1)
This works fine and well; however, I want to improve the code to be flexible given changes in the number of periods.
Is there a function that will allow me to fill the date columns with the odate in period 1 and then subsequent month ends for subsequent periods?
example dataset:
odate=20190531
period value
1 5.5
2 5
4 6.2
3 5
5 40
11 5
desired dataset:
odate=20190531
period value date
1 5.5 2019-05-31
2 5 2019-06-30
4 6.2 2019-08-31
3 5 2019-07-31
5 40 2019-09-30
11 5 2020-03-31
You can use pd.date_range():
pd.date_range(start = '2019-05-31', periods = 100,freq='M')
You can change total periods depending on what you need, the freq='M' means a Month-End frequency
Here is a list of Offset Aliases you can for freq parameter.
If you just want to add or subtract some period to a date you can use pd.DataOffset:
odate = pd.Timestamp('20191031')
odate
>> Timestamp('2019-10-31 00:00:00')
odate - pd.DateOffset(months=4)
>> Timestamp('2019-06-30 00:00:00')
odate + pd.DateOffset(months=4)
>> Timestamp('2020-02-29 00:00:00')
To add given the period column to Month Ends:
odate = pd.Timestamp('20190531')
df['date'] = df.period.apply(lambda x: odate + pd.offsets.MonthEnd(x-1))
df
period value date
0 1 5.5 2019-05-31
1 2 5.0 2019-06-30
2 4 6.2 2019-08-31
3 3 5.0 2019-07-31
4 5 40.0 2019-09-30
5 11 5.0 2020-03-31
To improve performance use list-comprehension:
df['date'] = [odate + pd.offsets.MonthEnd(period-1) for period in df.period]
How can I compare each row's "Price" value with the next 2 rows? I want to run a function for every row: If the current price is lower on any of the following 2 hours, I want to assign "Low" to the current row's "Action" column. If the current price is higher than on the following 2 hours, then assign "High". If the current price is nor the highest or the lowest of all the 3 hours compared, assign "Hold".
So how can I take the Price from each row and compare it to the following 2 rows with Pandas? The dataframe looks like this:
data.head()
Date Time Price Month Hour Action
0 2018-01-01 0 2633 January 1 NaN
1 2018-01-01 1 2643 January 2 NaN
2 2018-01-01 2 2610 January 3 NaN
3 2018-01-01 3 2470 January 4 NaN
4 2018-01-01 4 2474 January 5 NaN
The desired output in this case would look like this:
data.head()
Date Time Price Month Hour Action
0 2018-01-01 0 2633 January 1 Hold
1 2018-01-01 1 2643 January 2 High
2 2018-01-01 2 2610 January 3 High
3 2018-01-01 3 2470 January 4 Low
4 2018-01-01 4 2474 January 5 Hold
Thank you.
edit: probably can be easily done with for loops but I'm sure pandas has some better way to do this
You can use the function data['Price'].shift(-1) to get the next price in the current row and data['Price'].shift(-2) to get the price 2 periods ahead in the current row.
Next you can use slicing to select the rows where the next two rows are higher or lower than the current price and fill it with the desired value.
See below how this is done:
# Check if the current price is lower than the next 2 rows and assign to the column 'Action' the value 'Low' if this is true
data.loc[(data['Price'].shift(-2)> data['Price']) & (data['Price'].shift(-1) > data['Price']), 'Action'] = 'Low'
# Check if the current price is higher than the next 2 rows and assign to the column 'Action' the value 'High' if this is true
data.loc[(data['Price'].shift(-2)< data['Price']) & (data['Price'].shift(-1) < data['Price']), 'Action'] = 'High'
# fill the rest of the rows with the value Hold
data['Action'] = data['Action'].fillna('Hold')
We can write some conditions for this. And choose values based on those conditions with np.select. In our conditions we use .shift for this which compares the current row to the next two rows.
Note The last two rows will return Unknown since we don't have two days data to compare with. Which makes sense.
# Print the extended dataframe which is used
print(df)
Date Time Price Month Hour Action
0 2018-01-01 0 2633 January 1 NaN
1 2018-01-01 1 2643 January 2 NaN
2 2018-01-01 2 2610 January 3 NaN
3 2018-01-01 3 2470 January 4 NaN
4 2018-01-01 4 2474 January 5 NaN
5 2018-01-01 5 2475 January 6 NaN
6 2018-01-01 6 2471 January 7 NaN
Define conditions, choices and apply np.select
conditions = [
(df['Price'] > df['Price'].shift(-1)) & (df['Price'] > df['Price'].shift(-2)),
((df['Price'].between(df['Price'].shift(-1), df['Price'].shift(-2))) | (df['Price'].between(df['Price'].shift(-2), df['Price'].shift(-1)))),
(df['Price'] < df['Price'].shift(-1)) & (df['Price'] < df['Price'].shift(-2)),
]
choices = ['High', 'Hold', 'Low']
df['Action'] = np.select(conditions, choices, default='Unknown')
print(df)
Date Time Price Month Hour Action
0 2018-01-01 0 2633 January 1 Hold
1 2018-01-01 1 2643 January 2 High
2 2018-01-01 2 2610 January 3 High
3 2018-01-01 3 2470 January 4 Low
4 2018-01-01 4 2474 January 5 Hold
5 2018-01-01 5 2475 January 6 Unknown
6 2018-01-01 6 2471 January 7 Unknown
I started from creation of the source DataFrame, a bit longer than
your head:
df = pd.DataFrame(data=[[ '2018-01-01', 0, 2633, 'January', 1 ],
[ '2018-01-01', 1, 2643, 'January', 2 ], [ '2018-01-01', 2, 2610, 'January', 3 ],
[ '2018-01-01', 3, 2470, 'January', 4 ], [ '2018-01-01', 4, 2474, 'January', 5 ],
[ '2018-01-01', 5, 2475, 'January', 6 ]],
columns=['Date', 'Time', 'Price', 'Month', 'Hour']); df
The first step is to compute 2 auxiliary columns, P1 with the price
from the next hour and P2 with the price from 2 hours in advance:
df['P1'] = df.Price.diff(-1).fillna(0, downcast='infer')
df['P2'] = df.Price.diff(-2).fillna(0, downcast='infer')
Then we need a function to be applied to each row:
def fn(row):
if row.P1 < 0 and row.P2 < 0:
return 'Low'
elif row.P1 > 0 and row.P2 > 0:
return 'High'
else:
return 'Hold'
And the last step is to compute the new column (applying the above function)
and delete the auxiliary columns:
df['Action'] = df.apply(fn, axis=1)
df.drop(['P1', 'P2'], axis=1, inplace=True)
I am looking to determine the count of string variables in a column across a 3 month data sample. Samples were taken at random times throughout each day. I can group the data by hour, but I require the fidelity of 30 minute intervals (ex. 0500-0600, 0600-0630) on roughly 10k rows of data.
An example of the data:
datetime stringvalues
2018-06-06 17:00 A
2018-06-07 17:30 B
2018-06-07 17:33 A
2018-06-08 19:00 B
2018-06-09 05:27 A
I have tried setting the datetime column as the index, but I cannot figure how to group the data on anything other than 'hour' and I don't have fidelity on the string value count:
df['datetime'] = pd.to_datetime(df['datetime']
df.index = df['datetime']
df.groupby(df.index.hour).count()
Which returns an output similar to:
datetime stringvalues
datetime
5 0 0
6 2 2
7 5 5
8 1 1
...
I researched multi-indexing and resampling to some length the past two days but I have been unable to find a similar question. The desired result would look something like this:
datetime A B
0500 1 2
0530 3 5
0600 4 6
0630 2 0
....
There is no straightforward way to do a TimeGrouper on the time component, so we do this in two steps:
v = (df.groupby([pd.Grouper(key='datetime', freq='30min'), 'stringvalues'])
.size()
.unstack(fill_value=0))
v.groupby(v.index.time).sum()
stringvalues A B
05:00:00 1 0
17:00:00 1 0
17:30:00 1 1
19:00:00 0 1
I have pandas dataframe with Columns 'Date' and 'Skew(float no.)'. I want to average the values of the skew between every Tuesday and the store it in a list or dataframe. I tried using lambda as given in this question Pandas, groupby and summing over specific months I but it only helps to some over a particular week but i cannot go across week i.e from one tuesday to another. Can you give how to do the same?
Here's an example with random data
df = pd.DataFrame({'Date' : pd.date_range('20130101', periods=100),
'Skew': 10+pd.np.random.randn(100)})
min_date = df.Date.min()
start = min_date.dayofweek
if start < 1:
min_date = min_date - pd.np.timedelta64(6+start, 'D')
elif start > 1:
min_date = min_date - pd.np.timedelta64(start-1, 'D')
df.groupby((df.Date - min_date).astype('timedelta64[D]')//7).mean()
Input:
>>> df
Date Skew
0 2013-01-01 10.082080
1 2013-01-02 10.907402
2 2013-01-03 8.485768
3 2013-01-04 9.221740
4 2013-01-05 10.137910
5 2013-01-06 9.084963
6 2013-01-07 9.457736
7 2013-01-08 10.092777
Output:
Skew
Date
0 9.625371
1 9.993275
2 10.041077
3 9.837709
4 9.901311
5 9.985390
6 10.123757
7 9.782892
8 9.889291
9 9.853204
10 10.190098
11 10.594125
12 10.012265
13 9.278008
14 10.530251
Logic: Find relative week from the first week's Tuesday and groupby and each groups (i.e week no's) mean.