I am attempting to calculate the seasonal means for the winter months of DJF and DJ. I first tried to use Xarray's .groupby function:
ds.groupby('time.month').mean('time')
Then I realized that instead of grouping by the previous years' December and the subsequent Jan/Feb., it was grouping all three months from the same year. I was then able to figure out how to solve for the DJF season by resampling and creating a function to select out the proper 3 month period:
>def is_djf(month):
return (month == 12)
>ds.resample('QS-MAR').mean('time')
>ds.sel(time=is_djf(ds['time.month']))
I am still unfortunately unsure how to solve for the Dec./Jan. season since the resampling method I used was for offsetting quarterly. Thank you for any and all help!
Use resample with QS-DEC.
Suppose this dataframe:
time val
0 2020-12-31 1
1 2021-01-31 1
2 2021-02-28 1
3 2021-03-31 2
4 2021-04-30 2
5 2021-05-31 2
6 2021-06-30 3
7 2021-07-31 3
8 2021-08-31 3
9 2021-09-30 4
10 2021-10-31 4
11 2021-11-30 4
12 2021-12-31 5
13 2022-01-31 5
14 2022-02-28 5
>>> df.set_index('time').resample('QS-DEC').mean()
val
time
2020-12-01 1.0
2021-03-01 2.0
2021-06-01 3.0
2021-09-01 4.0
2021-12-01 5.0
Related
My problem is a very complex and confusing one, I haven't been able to find the answer anywhere.
I basically have 2 dataframes, one is price history of certain products and the other is invoice dataframe that contains transaction data.
Sample Data:
Price History:
product_id updated price
id
1 1 2022-01-01 5.0
2 2 2022-01-01 5.5
3 3 2022-01-01 5.7
4 1 2022-01-15 6.0
5 2 2022-01-15 6.5
6 3 2022-01-15 6.7
7 1 2022-02-01 7.0
8 2 2022-02-01 7.5
9 3 2022-02-01 7.7
Invoice:
transaction_date product_id quantity
id
1 2022-01-02 1 2
2 2022-01-02 2 3
3 2022-01-02 3 4
4 2022-01-14 1 1
5 2022-01-14 2 4
6 2022-01-14 3 2
7 2022-01-15 1 3
8 2022-01-15 2 6
9 2022-01-15 3 5
10 2022-01-16 1 3
11 2022-01-16 2 2
12 2022-01-16 3 3
13 2022-02-05 1 1
14 2022-02-05 2 4
15 2022-02-05 3 7
16 2022-05-10 1 4
17 2022-05-10 2 2
18 2022-05-10 3 1
What I am looking to achieve is to add the price column in the Invoice dataframe, based on:
The product id
Comparing the Updated and Transaction Date in a way that updated date <= transaction date for that particular record, basically finding the closest date after the price was updated. (The MAX date that is <= transaction date)
I managed to do this:
invoice['price'] = invoice['product_id'].map(price_history.set_index('id')['price'])
but need to incorporate the date condition now.
Expected result for sample data:
Expected Result
Any guidance in the correct direction is appreciated, thanks
merge_asof is what you are looking for:
pd.merge_asof(
invoice,
price_history,
left_on="transaction_date",
right_on="updated",
by="product_id",
)[["transaction_date", "product_id", "quantity", "price"]]
merge_asof with arg direction
merged = pd.merge_asof(
left=invoice,
right=price_history,
left_on="transaction_date",
right_on="updated",
by="product_id",
direction="backward",
suffixes=("", "_y")
).drop(columns=["id_y", "updated"]).reset_index(drop=True)
print(merged)
id transaction_date product_id quantity price
0 1 2022-01-02 1 2 5.0
1 2 2022-01-02 2 3 5.5
2 3 2022-01-02 3 4 5.7
3 4 2022-01-14 1 1 5.0
4 5 2022-01-14 2 4 5.5
5 6 2022-01-14 3 2 5.7
6 7 2022-01-15 1 3 6.0
7 8 2022-01-15 2 6 6.5
8 9 2022-01-15 3 5 6.7
9 10 2022-01-16 1 3 6.0
10 11 2022-01-16 2 2 6.5
11 12 2022-01-16 3 3 6.7
12 13 2022-02-05 1 1 7.0
13 14 2022-02-05 2 4 7.5
14 15 2022-02-05 3 7 7.7
15 16 2022-05-10 1 4 7.0
16 17 2022-05-10 2 2 7.5
17 18 2022-05-10 3 1 7.7
I'm having a python pandas dataframe with 2 relevant columns "date" and "value", let's assume it looks like this and is ordered by date:
data = pd.DataFrame({"date": ["2021-01-01", "2021-01-31", "2021-02-01", "2021-02-28", "2021-03-01", "2021-03-31", "2021-04-01", "2021-04-02"],
"value": [1,2,3,4,5,6,5,8]})
data["date"] = pd.to_datetime(data['date'])
Now I want to join the dataFrame to itself in such a way, that I get for each last available day in month the next available day where the value is higher. In our example this should basically look like this:
date, value, date2, value2:
2021-01-31, 2, 2021-02-01, 3
2021-02-28, 4, 2021-03-01, 5
2021-03-31, 6, 2021-04-02, 8
2021-04-02, 8, NaN, NaN
My current partial solution to this problem looks like this:
last_days = data.groupby([data.date.dt.year, data.date.dt.month]).last()
res = [data.loc[(data.date>date) & (data.value > value)][:1] for date, value in zip(last_days.date, last_days.value)]
print(res)
But because of this answer "Don't iterate over rows in a dataframe", it doesn't feel like the pandas way to me.
So the question is, how to solve it the pandas way?
If you don’t have too many rows, you could generate all pairs of items and filter from there.
Let’s start with getting the last days in the month:
>>> last = data.loc[data['date'].dt.daysinmonth == data['date'].dt.day]
>>> last
date value
1 2021-01-31 2
3 2021-02-28 4
5 2021-03-31 6
Now use a cross join to map each last day to any possible day, then filter on criteria such as later date and larger value:
>>> pairs = pd.merge(last, data, how='cross', suffixes=('', '2'))
>>> pairs = pairs.loc[pairs['date2'].gt(pairs['date']) & pairs['value2'].gt(pairs['value'])]
>>> pairs
date value date2 value2
2 2021-01-31 2 2021-02-01 3
3 2021-01-31 2 2021-02-28 4
4 2021-01-31 2 2021-03-01 5
5 2021-01-31 2 2021-03-31 6
6 2021-01-31 2 2021-04-01 5
7 2021-01-31 2 2021-04-02 8
12 2021-02-28 4 2021-03-01 5
13 2021-02-28 4 2021-03-31 6
14 2021-02-28 4 2021-04-01 5
15 2021-02-28 4 2021-04-02 8
23 2021-03-31 6 2021-04-02 8
Finally use GroupBy.idxmin() to get the first date2
>>> pairs.loc[pairs.groupby(['date', 'value'])['value2'].idxmin().values]
date value date2 value2
2 2021-01-31 2 2021-02-01 3
12 2021-02-28 4 2021-03-01 5
23 2021-03-31 6 2021-04-02 8
Otherwise you might want apply, which is pretty much the same as iterating on rows to be entirely honest.
First create 2 masks: one for the end day of month and another one for the first day of the next month.
m1 = data['date'].diff(1).shift(-1) == pd.Timedelta(days=1)
m2 = m1.shift(1, fill_value=False)
Finally, concatenate the 2 results ignoring index:
>>> pd.concat([data.loc[m1].reset_index(drop=True),
data.loc[m2].reset_index(drop=True)], axis="columns")
date value date value
0 2021-01-31 2 2021-02-01 3
1 2021-02-28 4 2021-03-01 5
2 2021-03-31 6 2021-04-01 5
3 2021-04-01 5 2021-04-02 8
One option is with the conditional_join from pyjanitor, which uses binary search underneath, and should be faster/more memory efficient than a cross merge, as the data size increases. Also, have a look at the piso library and see if it can be helpful/more efficient:
Get the last dates, via a groupby (assumption here is that the data is already sorted; if not, you can sort it before grouping):
# pip install pyjanitor
import pandas as pd
import janitor
trim = (data
.groupby([data.date.dt.year, data.date.dt.month], as_index = False)
.nth(-1)
)
trim
date value
1 2021-01-31 2
3 2021-02-28 4
5 2021-03-31 6
7 2021-04-02 8
Use conditional_join to get rows where the value from trim is less than data, and the date from trim is less than data as well:
trimmed = trim.conditional_join(data,
# variable arguments
# tuple is of the form:
# col_from_left_df, col_from_right_df, comparator
('value', 'value', '<'),
('date', 'date', '<'),
how = 'left')
trimmed
left right
date value date value
0 2021-01-31 2 2021-02-01 3.0
1 2021-01-31 2 2021-02-28 4.0
2 2021-01-31 2 2021-03-01 5.0
3 2021-01-31 2 2021-04-01 5.0
4 2021-01-31 2 2021-03-31 6.0
5 2021-01-31 2 2021-04-02 8.0
6 2021-02-28 4 2021-03-01 5.0
7 2021-02-28 4 2021-04-01 5.0
8 2021-02-28 4 2021-03-31 6.0
9 2021-02-28 4 2021-04-02 8.0
10 2021-03-31 6 2021-04-02 8.0
11 2021-04-02 8 NaT NaN
Since the only interest is in the first match, a groupby is required.
trimmed = (trimmed
.groupby(('left', 'date'), dropna = False, as_index = False)
.nth(0)
)
trimmed
left right
date value date value
0 2021-01-31 2 2021-02-01 3.0
6 2021-02-28 4 2021-03-01 5.0
10 2021-03-31 6 2021-04-02 8.0
11 2021-04-02 8 NaT NaN
You can fix the columns, to flat form:
trimmed.set_axis(['date', 'value', 'date2', 'value2'], axis = 'columns')
date value date2 value2
0 2021-01-31 2 2021-02-01 3.0
6 2021-02-28 4 2021-03-01 5.0
10 2021-03-31 6 2021-04-02 8.0
11 2021-04-02 8 NaT NaN
a python beginner here,
I am trying to get the highest price of a particular stock per month, and what date the maximum value occurred.
Getting the maximum value per month is okay using max()
but when I'm trying get the corresponding dates of the max price using idxmax(), my code returns the corresponding index instead of date. My code looks like this:
Max_Date = Daily_High.groupby(pd.Grouper(key="Date", freq="M")).High.idxmax()
Output
Date High
0 2020-04-30 9929
1 2020-05-31 9946
2 2020-06-30 9966
3 2020-07-31 9993
4 2020-08-31 10014
5 2020-09-30 10016
6 2020-10-31 10044
7 2020-11-30 10063
8 2020-12-31 10097
9 2021-01-31 10114
10 2021-02-28 10125
11 2021-03-31 10139
12 2021-04-30 10180
13 2021-05-31 10182
Output Should be like this
Date High Max Date
0 2020-04-30 2020-04-30
1 2020-05-31 2020-05-26
2 2020-06-30 2020-06-23
3 2020-07-31 2020-07-31
4 2020-08-31 2020-08-31
5 2020-09-30 2020-09-02
6 2020-10-31 2020-10-13
7 2020-11-30 2020-11-09
8 2020-12-31 2020-12-29
9 2021-01-31 2021-01-25
10 2021-02-28 2021-02-09
11 2021-03-31 2021-03-02
12 2021-04-30 2021-04-29
13 2021-05-31 2021-05-03
Hope you can help me to get the correct date. Thank you!
Create DatetimeIndex and remove key="Date" from pd.Grouper:
Max_Date = Daily_High.set_index('Date').groupby(pd.Grouper( freq="M")).High.idxmax()
I tried to ask this question previously, but it was too ambiguous so here goes again. I am new to programming, so I am still learning how to ask questions in a useful way.
In summary, I have a pandas dataframe that resembles "INPUT DATA" that I would like to convert to "DESIRED OUTPUT", as shown below.
Each row contains an ID, a DateTime, and a Value. For each unique ID, the first row corresponds to timepoint 'zero', and each subsequent row contains a value 5 minutes following the previous row and so on.
I would like to calculate the mean of all the IDs for every 'time elapsed' timepoint. For example, in "DESIRED OUTPUT" Time Elapsed=0.0 would have the value 128.3 (100+105+180/3); Time Elapsed=5.0 would have the value 150.0 (150+110+190/3); Time Elapsed=10.0 would have the value 133.3 (125+90+185/3) and so on for Time Elapsed=15,20,25 etc.
I'm not sure how to create a new column which has the value for the time elapsed for each ID (e.g. 0.0, 5.0, 10.0 etc). I think that once I know how to do that, then I can use the groupby function to calculate the means for each time elapsed.
INPUT DATA
ID DateTime Value
1 2018-01-01 15:00:00 100
1 2018-01-01 15:05:00 150
1 2018-01-01 15:10:00 125
2 2018-02-02 13:15:00 105
2 2018-02-02 13:20:00 110
2 2018-02-02 13:25:00 90
3 2019-03-03 05:05:00 180
3 2019-03-03 05:10:00 190
3 2019-03-03 05:15:00 185
DESIRED OUTPUT
Time Elapsed Mean Value
0.0 128.3
5.0 150.0
10.0 133.3
Here is one way , using transform with groupby get the group key 'Time Elapsed', then just groupby it get the mean
df['Time Elapsed']=df.DateTime-df.groupby('ID').DateTime.transform('first')
df.groupby('Time Elapsed').Value.mean()
Out[998]:
Time Elapsed
00:00:00 128.333333
00:05:00 150.000000
00:10:00 133.333333
Name: Value, dtype: float64
You can do this explicitly by taking advantage of the datetime attributes of the DateTime column in your DataFrame
First get the year, month and day for each DateTime since they are all changing in your data
df['month'] = df['DateTime'].dt.month
df['day'] = df['DateTime'].dt.day
df['year'] = df['DateTime'].dt.year
print(df)
ID DateTime Value month day year
1 1 2018-01-01 15:00:00 100 1 1 2018
1 1 2018-01-01 15:05:00 150 1 1 2018
1 1 2018-01-01 15:10:00 125 1 1 2018
2 2 2018-02-02 13:15:00 105 2 2 2018
2 2 2018-02-02 13:20:00 110 2 2 2018
2 2 2018-02-02 13:25:00 90 2 2 2018
3 3 2019-03-03 05:05:00 180 3 3 2019
3 3 2019-03-03 05:10:00 190 3 3 2019
3 3 2019-03-03 05:15:00 185 3 3 2019
Then append a sequential DateTime counter column (per this SO post)
the counter is computed within (1) each year, (2) then each month and then (3) each day
since the data are in multiples of 5 minutes, use this to scale the counter values (i.e. the counter will be in multiples of 5 minutes, rather than a sequence of increasing integers)
df['Time Elapsed'] = df.groupby(['year', 'month', 'day']).cumcount() + 1
df['Time Elapsed'] *= 5
print(df)
ID DateTime Value month day year cumulative_record
1 1 2018-01-01 15:00:00 100 1 1 2018 5
1 1 2018-01-01 15:05:00 150 1 1 2018 10
1 1 2018-01-01 15:10:00 125 1 1 2018 15
2 2 2018-02-02 13:15:00 105 2 2 2018 5
2 2 2018-02-02 13:20:00 110 2 2 2018 10
2 2 2018-02-02 13:25:00 90 2 2 2018 15
3 3 2019-03-03 05:05:00 180 3 3 2019 5
3 3 2019-03-03 05:10:00 190 3 3 2019 10
3 3 2019-03-03 05:15:00 185 3 3 2019 15
Perform the groupby over the newly appended counter column
dfg = df.groupby('Time Elapsed')['Value'].mean()
print(dfg)
Time Elapsed
5 128.333333
10 150.000000
15 133.333333
Name: Value, dtype: float64
I have a df pandas with in column just a 'Price' and in index dates. I want to find a new column called 'Aprox' with inside
aprox. = price of today - price of one year ago (or closest date from a year ago) -
price in one year (again take aprox if exact one year price don't exist)
for example
aprox. 2019-04-30 = 8 -4 -10 = -6 = aprox. 2019-04-30
- aprox. 2018-01-31 - aprox.2020-07-30
To be honest I am a bit strugling with that...
ex. [in]: Price
2018-01-31 4
2019-04-30 8
2020-07-30 10
2020-10-31 9
2021-01-31 14
2021-04-30 150
2021-07-30 20
2022-10-31 14
[out]: Price aprox.
2018-01-31 4
2019-04-30 8 -6 ((8-4-10) = -6) since there is no 2018-04-30
2020-07-30 10 -12 (10-14-8)
2020-10-31 9 ...
2021-01-31 14 ...
2021-04-30 150
2021-07-30 20
2022-10-31 14
I am strugling very much with that... even more with the approx.
Thank you very much!!
It's not quite clear to me what you are trying to do, but maybe this is what you want:
import pandas
def last_year(x):
"""
Return date from a year ago.
"""
return x - pandas.DateOffset(years=1)
# Simulate the data you provided in example
dt_str = ['2018-01-31', '2019-04-30', '2020-07-30', '2020-10-31',
'2021-01-31', '2021-04-30', '2021-07-30', '2022-10-31']
dates = [pandas.Timestamp(x) for x in dt_str]
df = pandas.DataFrame([4, 8, 10, 9, 14, 150, 20, 14], columns=['Price'], index=dates)
# This is the code that does the work
for dt, value in df['Price'].iteritems():
df.loc[dt, 'approx'] = value - df['Price'].asof(last_year(dt))
This gave me the following results:
In [147]: df
Out[147]:
Price approx
2018-01-31 4 NaN
2019-04-30 8 4.0
2020-07-30 10 2.0
2020-10-31 9 1.0
2021-01-31 14 6.0
2021-04-30 150 142.0
2021-07-30 20 10.0
2022-10-31 14 -6.0
The bottom line is that for this type of operation you can't just use the apply operation since you need both the index and the value.