I have a df pandas with in column just a 'Price' and in index dates. I want to find a new column called 'Aprox' with inside
aprox. = price of today - price of one year ago (or closest date from a year ago) -
price in one year (again take aprox if exact one year price don't exist)
for example
aprox. 2019-04-30 = 8 -4 -10 = -6 = aprox. 2019-04-30
- aprox. 2018-01-31 - aprox.2020-07-30
To be honest I am a bit strugling with that...
ex. [in]: Price
2018-01-31 4
2019-04-30 8
2020-07-30 10
2020-10-31 9
2021-01-31 14
2021-04-30 150
2021-07-30 20
2022-10-31 14
[out]: Price aprox.
2018-01-31 4
2019-04-30 8 -6 ((8-4-10) = -6) since there is no 2018-04-30
2020-07-30 10 -12 (10-14-8)
2020-10-31 9 ...
2021-01-31 14 ...
2021-04-30 150
2021-07-30 20
2022-10-31 14
I am strugling very much with that... even more with the approx.
Thank you very much!!
It's not quite clear to me what you are trying to do, but maybe this is what you want:
import pandas
def last_year(x):
"""
Return date from a year ago.
"""
return x - pandas.DateOffset(years=1)
# Simulate the data you provided in example
dt_str = ['2018-01-31', '2019-04-30', '2020-07-30', '2020-10-31',
'2021-01-31', '2021-04-30', '2021-07-30', '2022-10-31']
dates = [pandas.Timestamp(x) for x in dt_str]
df = pandas.DataFrame([4, 8, 10, 9, 14, 150, 20, 14], columns=['Price'], index=dates)
# This is the code that does the work
for dt, value in df['Price'].iteritems():
df.loc[dt, 'approx'] = value - df['Price'].asof(last_year(dt))
This gave me the following results:
In [147]: df
Out[147]:
Price approx
2018-01-31 4 NaN
2019-04-30 8 4.0
2020-07-30 10 2.0
2020-10-31 9 1.0
2021-01-31 14 6.0
2021-04-30 150 142.0
2021-07-30 20 10.0
2022-10-31 14 -6.0
The bottom line is that for this type of operation you can't just use the apply operation since you need both the index and the value.
Related
For example, I have several columns of dates and I want to get the month from them. Is there a way to loop through columns instead of running pd.DatetimeIndex(df['date']).month
multiple times? The example below is simplified. The real dataset has many more columns.
import pandas as pd
import numpy as np
np.random.seed(0)
rng_start = pd.date_range('2015-07-24', periods=5, freq='M')
rng_mid = pd.date_range('2019-06-24', periods=5, freq='M')
rng_end = pd.date_range('2022-03-24', periods=5, freq='M')
df = pd.DataFrame({ 'start_date': rng_start, 'mid_date': rng_mid, 'end_date': rng_end })
df
start_date mid_date end_date
0 2015-07-31 2019-06-30 2022-03-31
1 2015-08-31 2019-07-31 2022-04-30
2 2015-09-30 2019-08-31 2022-05-31
3 2015-10-31 2019-09-30 2022-06-30
4 2015-11-30 2019-10-31 2022-07-31
The intended output would be
start_date mid_date end_date start_month mid_month end_month
0 2015-07-31 2019-06-30 2022-03-31 7 6 3
1 2015-08-31 2019-07-31 2022-04-30 8 7 4
2 2015-09-30 2019-08-31 2022-05-31 9 8 5
3 2015-10-31 2019-09-30 2022-06-30 10 9 6
4 2015-11-30 2019-10-31 2022-07-31 11 10 7
You answered your question by saying "loop through columns":
for column in df:
df[column.replace("_date", "_month")] = df[column].dt.month
An alternative solution (a variation of #BENY's):
df[df.columns.str.replace("_date", "_month")] = df.apply(lambda x: x.dt.month, axis=1)
Try apply
df[['start_month', 'mid_month', 'end_month']] = df.apply(lambda x : x.dt.month,axis=1)
df
Out[244]:
start_date mid_date end_date start_month mid_month end_month
0 2015-07-31 2019-06-30 2022-03-31 7 6 3
1 2015-08-31 2019-07-31 2022-04-30 8 7 4
2 2015-09-30 2019-08-31 2022-05-31 9 8 5
3 2015-10-31 2019-09-30 2022-06-30 10 9 6
4 2015-11-30 2019-10-31 2022-07-31 11 10 7
You can avoid looping using stack:
out = df.join(df.filter(like='_date') # select _date columns
.stack() # convert to Series
.dt.month
.unstack() # back to DataFrame
.rename(columns=lambda x: x.replace('_date', '_month'))
)
Output:
start_date mid_date end_date start_month mid_month end_month
0 2015-07-31 2019-06-30 2022-03-31 7 6 3
1 2015-08-31 2019-07-31 2022-04-30 8 7 4
2 2015-09-30 2019-08-31 2022-05-31 9 8 5
3 2015-10-31 2019-09-30 2022-06-30 10 9 6
4 2015-11-30 2019-10-31 2022-07-31 11 10 7
quite similar to this solution bub a bit different:
df.join(df.applymap(lambda x: x.month).
set_axis(['start_month', 'mid_month', 'end_month'],axis=1))
I am attempting to calculate the seasonal means for the winter months of DJF and DJ. I first tried to use Xarray's .groupby function:
ds.groupby('time.month').mean('time')
Then I realized that instead of grouping by the previous years' December and the subsequent Jan/Feb., it was grouping all three months from the same year. I was then able to figure out how to solve for the DJF season by resampling and creating a function to select out the proper 3 month period:
>def is_djf(month):
return (month == 12)
>ds.resample('QS-MAR').mean('time')
>ds.sel(time=is_djf(ds['time.month']))
I am still unfortunately unsure how to solve for the Dec./Jan. season since the resampling method I used was for offsetting quarterly. Thank you for any and all help!
Use resample with QS-DEC.
Suppose this dataframe:
time val
0 2020-12-31 1
1 2021-01-31 1
2 2021-02-28 1
3 2021-03-31 2
4 2021-04-30 2
5 2021-05-31 2
6 2021-06-30 3
7 2021-07-31 3
8 2021-08-31 3
9 2021-09-30 4
10 2021-10-31 4
11 2021-11-30 4
12 2021-12-31 5
13 2022-01-31 5
14 2022-02-28 5
>>> df.set_index('time').resample('QS-DEC').mean()
val
time
2020-12-01 1.0
2021-03-01 2.0
2021-06-01 3.0
2021-09-01 4.0
2021-12-01 5.0
I'm having a python pandas dataframe with 2 relevant columns "date" and "value", let's assume it looks like this and is ordered by date:
data = pd.DataFrame({"date": ["2021-01-01", "2021-01-31", "2021-02-01", "2021-02-28", "2021-03-01", "2021-03-31", "2021-04-01", "2021-04-02"],
"value": [1,2,3,4,5,6,5,8]})
data["date"] = pd.to_datetime(data['date'])
Now I want to join the dataFrame to itself in such a way, that I get for each last available day in month the next available day where the value is higher. In our example this should basically look like this:
date, value, date2, value2:
2021-01-31, 2, 2021-02-01, 3
2021-02-28, 4, 2021-03-01, 5
2021-03-31, 6, 2021-04-02, 8
2021-04-02, 8, NaN, NaN
My current partial solution to this problem looks like this:
last_days = data.groupby([data.date.dt.year, data.date.dt.month]).last()
res = [data.loc[(data.date>date) & (data.value > value)][:1] for date, value in zip(last_days.date, last_days.value)]
print(res)
But because of this answer "Don't iterate over rows in a dataframe", it doesn't feel like the pandas way to me.
So the question is, how to solve it the pandas way?
If you don’t have too many rows, you could generate all pairs of items and filter from there.
Let’s start with getting the last days in the month:
>>> last = data.loc[data['date'].dt.daysinmonth == data['date'].dt.day]
>>> last
date value
1 2021-01-31 2
3 2021-02-28 4
5 2021-03-31 6
Now use a cross join to map each last day to any possible day, then filter on criteria such as later date and larger value:
>>> pairs = pd.merge(last, data, how='cross', suffixes=('', '2'))
>>> pairs = pairs.loc[pairs['date2'].gt(pairs['date']) & pairs['value2'].gt(pairs['value'])]
>>> pairs
date value date2 value2
2 2021-01-31 2 2021-02-01 3
3 2021-01-31 2 2021-02-28 4
4 2021-01-31 2 2021-03-01 5
5 2021-01-31 2 2021-03-31 6
6 2021-01-31 2 2021-04-01 5
7 2021-01-31 2 2021-04-02 8
12 2021-02-28 4 2021-03-01 5
13 2021-02-28 4 2021-03-31 6
14 2021-02-28 4 2021-04-01 5
15 2021-02-28 4 2021-04-02 8
23 2021-03-31 6 2021-04-02 8
Finally use GroupBy.idxmin() to get the first date2
>>> pairs.loc[pairs.groupby(['date', 'value'])['value2'].idxmin().values]
date value date2 value2
2 2021-01-31 2 2021-02-01 3
12 2021-02-28 4 2021-03-01 5
23 2021-03-31 6 2021-04-02 8
Otherwise you might want apply, which is pretty much the same as iterating on rows to be entirely honest.
First create 2 masks: one for the end day of month and another one for the first day of the next month.
m1 = data['date'].diff(1).shift(-1) == pd.Timedelta(days=1)
m2 = m1.shift(1, fill_value=False)
Finally, concatenate the 2 results ignoring index:
>>> pd.concat([data.loc[m1].reset_index(drop=True),
data.loc[m2].reset_index(drop=True)], axis="columns")
date value date value
0 2021-01-31 2 2021-02-01 3
1 2021-02-28 4 2021-03-01 5
2 2021-03-31 6 2021-04-01 5
3 2021-04-01 5 2021-04-02 8
One option is with the conditional_join from pyjanitor, which uses binary search underneath, and should be faster/more memory efficient than a cross merge, as the data size increases. Also, have a look at the piso library and see if it can be helpful/more efficient:
Get the last dates, via a groupby (assumption here is that the data is already sorted; if not, you can sort it before grouping):
# pip install pyjanitor
import pandas as pd
import janitor
trim = (data
.groupby([data.date.dt.year, data.date.dt.month], as_index = False)
.nth(-1)
)
trim
date value
1 2021-01-31 2
3 2021-02-28 4
5 2021-03-31 6
7 2021-04-02 8
Use conditional_join to get rows where the value from trim is less than data, and the date from trim is less than data as well:
trimmed = trim.conditional_join(data,
# variable arguments
# tuple is of the form:
# col_from_left_df, col_from_right_df, comparator
('value', 'value', '<'),
('date', 'date', '<'),
how = 'left')
trimmed
left right
date value date value
0 2021-01-31 2 2021-02-01 3.0
1 2021-01-31 2 2021-02-28 4.0
2 2021-01-31 2 2021-03-01 5.0
3 2021-01-31 2 2021-04-01 5.0
4 2021-01-31 2 2021-03-31 6.0
5 2021-01-31 2 2021-04-02 8.0
6 2021-02-28 4 2021-03-01 5.0
7 2021-02-28 4 2021-04-01 5.0
8 2021-02-28 4 2021-03-31 6.0
9 2021-02-28 4 2021-04-02 8.0
10 2021-03-31 6 2021-04-02 8.0
11 2021-04-02 8 NaT NaN
Since the only interest is in the first match, a groupby is required.
trimmed = (trimmed
.groupby(('left', 'date'), dropna = False, as_index = False)
.nth(0)
)
trimmed
left right
date value date value
0 2021-01-31 2 2021-02-01 3.0
6 2021-02-28 4 2021-03-01 5.0
10 2021-03-31 6 2021-04-02 8.0
11 2021-04-02 8 NaT NaN
You can fix the columns, to flat form:
trimmed.set_axis(['date', 'value', 'date2', 'value2'], axis = 'columns')
date value date2 value2
0 2021-01-31 2 2021-02-01 3.0
6 2021-02-28 4 2021-03-01 5.0
10 2021-03-31 6 2021-04-02 8.0
11 2021-04-02 8 NaT NaN
I have a dataframe, df that looks like this
Date Value
10/1/2019 5
10/2/2019 10
10/3/2019 15
10/4/2019 20
10/5/2019 25
10/6/2019 30
10/7/2019 35
I would like to calculate the delta for a period of 7 days
Desired output:
Date Delta
10/1/2019 30
This is what I am doing: A user has helped me with a variation of the code below:
df['Delta']=df.iloc[0:,1].sub(df.iloc[6:,1]), Date=pd.Series
(pd.date_range(pd.Timestamp('2019-10-01'),
periods=7, freq='7d'))[['Delta','Date']]
Any suggestions is appreciated
Let us try shift
s = df.set_index('Date')['Value']
df['New'] = s.shift(freq = '-6 D').reindex(s.index).values
df['DIFF'] = df['New'] - df['Value']
df
Out[39]:
Date Value New DIFF
0 2019-10-01 5 35.0 30.0
1 2019-10-02 10 NaN NaN
2 2019-10-03 15 NaN NaN
3 2019-10-04 20 NaN NaN
4 2019-10-05 25 NaN NaN
5 2019-10-06 30 NaN NaN
6 2019-10-07 35 NaN NaN
I have solutions for this question, 2 solutions in fact, but I'm not happy with them. The reason is that the files I'm trying to read have about 12 millions rows, and using these solutions, it takes a huge amount of time to process them. Mainly, the reason is that the solutions are row-by-row operations.
So, I read the file like this:
In [1]: df = pd.read_csv('C:/Projects/NPMRDS/FHWA_TASK2-4_NJ_09_2013_TT.CSV')
df.head()
Out [1]: TMC DATE EPOCH Travel_TIME_ALL_VEHICLES Travel_TIME_PASSENGER_VEHICLES Travel_TIME_FREIGHT_TRUCKS
0 103N04152 9252013 211 12 12 NaN
1 103N04152 9262013 0 7 7 NaN
2 103N04152 9032013 177 8 8 NaN
3 103N04152 9042013 176 8 9 7
My problem is with the DATE and EPOCH columns. I want to merge them into a single datetime column.
DATE is in '%m%d%Y' format (with the leading zero missing)
EPOCH is 5 minute epoch of a day:
Time EPOCH
00:00:00 => 0
00:05:00 => 1
...
...
12:00:00 => 144
12:05:00 => 145
...
...
23:50:00 => 286
23:55:00 => 287
What I want is something like this:
In [2]: df.head()
Out [2]: TMC DATE_TIME DATE EPOCH Travel_TIME_ALL_VEHICLES Travel_TIME_PASSENGER_VEHICLES Travel_TIME_FREIGHT_TRUCKS
0 103N04152 2013-09-25 17:35:00 9252013 211 12 12 NaN
1 103N04152 2013-09-26 00:00:00 9262013 0 7 7 NaN
2 103N04152 2013-09-03 14:45:00 9032013 177 8 8 NaN
3 103N04152 2013-09-04 14:30:00 9042013 176 8 9 7
Now, I can do this row-by-row as I mentioned earlier by doing either of these three things:
In [3]: df = pd.read_csv('C:/Projects/NPMRDS/FHWA_TASK2-4_NJ_09_2013_TT.CSV',
converters={'DATE': lambda x: datetime.datetime.strptime(x, '%m%d%Y'),
'EPOCH': lambda x: str(datetime.timedelta(minutes = int(x)*5))},
parse_dates = {'date_time': ['DATE', 'EPOCH']},
keep_date_col = True)
df.head()
Out [3]: date_time TMC DATE EPOCH Travel_TIME_ALL_VEHICLES Travel_TIME_PASSENGER_VEHICLES Travel_TIME_FREIGHT_TRUCKS
0 2013-09-25 17:35:00 103N04152 2013-09-25 17:35:00 12 12 NaN
1 2013-09-26 00:00:00 103N04152 2013-09-26 00:00:00 7 7 NaN
2 2013-09-03 14:45:00 103N04152 2013-09-03 14:45:00 8 8 NaN
3 2013-09-04 14:40:00 103N04152 2013-09-04 14:40:00 8 9 7
4 2013-09-05 09:35:00 103N04152 2013-09-05 09:35:00 10 10 NaN
In this method I lose the original formatting of DATE and EPOCH, but it doesn't really affect further computations on the dataframe. Instead of using converters as an argument, I could have used date_parser. Or, after reading the data, similar to line 1, I could have done something like this:
In [4]: df = pd.read_csv('C:/Projects/NPMRDS/FHWA_TASK2-4_NJ_09_2013_TT.CSV')
df['date_time'] = pd.to_datetime([datetime.datetime.strptime(str(df['DATE'][x]), '%m%d%Y') + datetime.timedelta(minutes = int(df['EPOCH'][x]*5)) for x in range(len(df))])
df.head()
Out [4]: TMC DATE EPOCH Travel_TIME_ALL_VEHICLES Travel_TIME_PASSENGER_VEHICLES Travel_TIME_FREIGHT_TRUCKS DATE_TIME
0 103N04152 9252013 211 12 12 NaN 2013-09-25 17:35:00
1 103N04152 9262013 0 7 7 NaN 2013-09-26 00:00:00
2 103N04152 9032013 177 8 8 NaN 2013-09-03 14:45:00
3 103N04152 9042013 176 8 9 7 2013-09-04 14:40:00
4 103N04152 9052013 115 10 10 NaN 2013-09-05 09:35:00
A more desirable result (don't worry about the column orders), but still row-by-row, and takes a huge amount of time.
Then there are pandas.to_datetime and pandas.to_timedelta, which run much faster than the methods described above. But I cannot merge the results together without resorting to string functions, which are again mainly row-by-row.
Does anyone know a better way to do this?
Edit: Solution!!!
In addition to chrisb's answer, I found a way to do it as well. The trick lies in setting the box parameter to False in pandas.to_datetime(). Like so:
df['DATE_TIME'] = pd.to_datetime(df['DATE'], format='%m%d%Y', box=False) + pd.to_timedelta(df['EPOCH']*5*60, unit='s')
Setting that to False returns a numpy.datetime[64] array, instead of pandas.DatetimeIndex. More information can be found in the pandas.to_datetime() documentation. And, pandas.to_timedelta() does not work with unit='m'.
Try this out - reduced runtime for me to about 1s (compared to 15s) on 4M rows of test data.
df = pd.read_csv('temp.csv')
df['DATE'] = pd.to_datetime(df['DATE'], format='%m%d%Y')
df['EPOCH'] = pd.to_timedelta((df['EPOCH'].astype(int) * 5).astype('timedelta64[m]'))
df['DATE_TIME'] = df['DATE'] + df['EPOCH']