Finding last trade from pandas dataframe - python

I have a table of trades, which have the form (for simplicity):
Ticker Timestamp price
0 AAPL 9:30:00 139
1 FB 11:33:14 110
And so on. Now, I want to extract the last trade for the day for each ticker, which is certainly possible thus (assuming the original table is called trades).
trades['Timestamp']=pd.to_datetime(trades['Timestamp'])
aux = trades.groupby(['Ticker'])['Timestamp'].max()
auxdf = aux.to_frame()
auxdf = auxdf.reset_index()
closing = pd.merge(left=trades,right=auxdf, left_on=['Ticker','Timestamp'],right_on=['Ticker', 'Timestamp'])
Now, this works, but I am not sure if it is either the most elegant or the most efficient approach. Any suggestions?

Try to use ix and idxmax:
trades['Timestamp']=pd.to_datetime(trades['Timestamp']
trades.ix[trades.groupby('Ticker').Timestamp.idxmax()]

Related

Grouping Dataframe by Multiple Columns, and Then Dropping Duplicates

I have a dataframe which looks like this (see table). For simplicity sake I've "aapl" is the only ticker shown. However, the real dataframe has more tickers.
ticker
year
return
aapl
1999
1
aapl
2000
3
aapl
2000
2
What I'd like to do is first group the dataframe by ticker, then by year. Next, I'd like to remove any duplicate years. In the end the dataframe should look like this:
ticker
year
return
aapl
1999
1
aapl
2000
3
I have a working solution, but it's not very "Pandas-esque", and involves for loops. I'm semi-certain that if I come back to the solution in three months, it'll be completely foreign to me.
Right now, I've been working on the following, with little luck:
df = df.groupby('ticker').groupby('year').drop_duplicates(subset=['year'])
This however, produces the following error:
AttributeError: 'DataFrameGroupBy' object has no attribute 'groupby'
Any help here would be greatly appreciated, thanks.
#QuangHoang provided the simplest version in the comments:
df.drop_duplicates(['ticker', 'year'])
Alternatively, you can use .groupby twice, inside two .applys:
df.groupby("ticker", group_keys=False).apply(lambda x:
x.groupby("year", group_keys=False).apply(lambda x: x.drop_duplicates(['year']))
)
Alternatively, you can use the .duplicated function:
df.groupby('ticker', group_keys=False).apply(lambda x:
x[~x['year'].duplicated(keep='first')])
)
You can try to sort the values first and then groupby.tail
df.sort_values('return').groupby(['ticker','year']).tail(1)
ticker year return
0 aapl 1999 1
1 aapl 2000 3
I'm almost sure you want to do this:
df.drop_duplicates(subset=["ticker","year"])
output

How to reference row below in Python Pandas Dataframe?

I have a function that gets the stock price (adjusted closing price) on a specific date of the format DD-MM-YYYY. I have a Pandas Dataframe that looks like the following, with a column for date, as well as the stock price calculated using said function.
Date Stock price Percent change
0 02-07-2022 22.09
1 06-04-2022 18.22
2 01-01-2022 16.50
3 30-09-2021 18.15
4 03-07-2021 17.96
I need to calculate the percent change, which is calculated by taking (new/old - 1)*100, so in the top cell it would say (22.09/18.22 - 1)*100 = 21.24039517 because the stock increased 21.2% between 06-04-2022 and 02-07-2022.
So I need to "reference" the row below when applying a function, meanwhile I still reference the current row, because I need both to calculate change. For the bottom one, it can just be NaN or similar. Any suggestions?
I would first sort on date (given that that column is already datetime):
df = df.sort_values(by='Date', ascending=True)
And then calculate the percentage change and fill NaN with 0, or with something else if you prefer:
df["Percent change"] = df["Stock price"].pct_change(periods=1).fillna(0)

Pandas.DataFrame - find the oldest date for which a value is available

I have a pandas.DataFrame object containing 2 time series. One series is much shorter than the other.
I want to determine the farther date for which a data is available in the shortest series, and remove data in the 2 columns before that date.
What is the most pythonic way to do that?
(I apologize that I don't really follow the SO guideline for submitting questions)
Here is a fragment of my dataframe:
osr go
Date
1990-08-17 NaN 239.75
1990-08-20 NaN 251.50
1990-08-21 352.00 265.00
1990-08-22 353.25 274.25
1990-08-23 351.75 290.25
In this case, I want to get rid of all rows before 1990-08-21 (I add there may be NAs in one of the columns for more recent dates)
You can use idxmax in inverted s by df['osr'][::-1] and then use subset of df:
print df
# osr go
#Date
#1990-08-17 NaN 239.75
#1990-08-20 NaN 251.50
#1990-08-21 352.00 265.00
#1990-08-22 353.25 274.25
#1990-08-23 351.75 290.25
s = df['osr'][::-1]
print s
#Date
#1990-08-23 351.75
#1990-08-22 353.25
#1990-08-21 352.00
#1990-08-20 NaN
#1990-08-17 NaN
#Name: osr, dtype: float64
maxnull = s.isnull().idxmax()
print maxnull
#1990-08-20 00:00:00
print df[df.index > maxnull]
# osr go
#Date
#1990-08-21 352.00 265.00
#1990-08-22 353.25 274.25
#1990-08-23 351.75 290.25
EDIT: New answer based upon comments/edits
It sounds like the data is sequential and once you have lines that don't have data you want to throw them out. This can be done easily with dropna.
df = df.dropna()
This answer assumes that once you are passed the bad rows, they stay good. Or if you don't care about dropping rows in the middle...depends on how sequential you need to be. If the data needs to be sequential and your input is well formed jezrael answer is good
Original answer
You haven't given much here by way of structure in your dataframe so I am going to make assumptions here. I'm going to assume you have many columns, two of which: time_series_1 and time_series_2 are the ones you referred to in your question and this is all stored in df
First we can find the shorter series by just using
shorter_col = df['time_series_1'] if len(df['time_series_1']) > len(df['time_series_2']) else df['time_series_2']
Now we want the last date in that
remove_date = max(shorter_col)
Now we want to remove data before that date
mask = (df['time_series_1'] > remove_date) | (df['time_series_2'] > remove_date)
df = df[mask]

Search in pandas dataframe

Potentially a slightly misleading title but the problem is this:
I have a large dataframe with multiple columns. This looks a bit like
df =
id date value
A 01-01-2015 1.0
A 03-01-2015 1.2
...
B 01-01-2015 0.8
B 02-01-2015 0.8
...
What I want to do is within each of the IDs I identify the date one week earlier and place the value on this date into e.g. a 'lagvalue' column. The problem comes with not all dates existing for all ids so a simple .shift(7) won't pull the correct value [in this instance I guess I should put a NaN in].
I can do this with a lot of horrible iterating over the dates and ids to find the value, for example some rough idea
[
df[
df['date'] == df['date'].iloc[i] - datetime.timedelta(weeks=1)
][
df['id'] == df['id'].iloc[i]
]['value']
for i in range(len(df.index))
]
but I'm certain there is a 'better' way to do it that cuts down on time and processing that I just can't think of right now.
I could write a function using a groupby on the id and then look within that and I'm certain that would reduce the time it would take to perform the operation - is there a much quicker, simpler way [aka am I having a dim day]?
Basic strategy is, for each id, to:
Use date index
Use reindex to expand the data to include all dates
Use shift to shift 7 spots
Use ffill to do last value interpolation. I'm not sure if you want this, or possibly bfill which will use the last value less than a week in the past. But simple to change. Alternatively, if you want NaN when not available 7 days in the past, you can just remove the *fill completely.
Drop unneeded data
This algorithm gives NaN when the lag is too far in the past.
There are a few assumptions here. In particular that the dates are unique inside each id and they are sorted. If not sorted, then use sort_values to sort by id and date. If there are duplicate dates, then some rules will be needed to resolve which values to use.
import pandas as pd
import numpy as np
dates = pd.date_range('2001-01-01',periods=100)
dates = dates[::3]
A = pd.DataFrame({'date':dates,
'id':['A']*len(dates),
'value':np.random.randn(len(dates))})
dates = pd.date_range('2001-01-01',periods=100)
dates = dates[::5]
B = pd.DataFrame({'date':dates,
'id':['B']*len(dates),
'value':np.random.randn(len(dates))})
df = pd.concat([A,B])
with_lags = []
for id, group in df.groupby('id'):
group = group.set_index(group.date)
index = group.index
group = group.reindex(pd.date_range(group.index[0],group.index[-1]))
group = group.ffill()
group['lag_value'] = group.value.shift(7)
group = group.loc[index]
with_lags.append(group)
with_lags = pd.concat(with_lags, 0)
with_lags.index = np.arange(with_lags.shape[0])

Count repeating events in pandas timeseries

I'm working on very basic pandas since a few days but struggle with my current task:
I have a (non normalized) timeseries with items that contains a userid per timestamp. So something like: (date, userid, payload) So think about an server logffile where I would like to find how much IPs return within a certain timeperiod.
Now I like to find how much of the users have multiple items within an intervall for example in 4 weeks etc. So it's more a sliding window than constant intervals on the t-axis.
So my approaches were:
df_users reindex on userids
or multiindex?
Sadly I didn't found a way to generate the results successfully.
So all in all I'm not sure how I realize that kind of search with Pandas, or maybe this is easier to implement in pure Python? Or do I just lack some keywords for that problem?
Some dummy data that I think fits your problem.
df = pd.DataFrame({'id': ['A','A','A','B','B','B','C','C','C'],
'time': ['2013-1-1', '2013-1-2', '2013-1-3',
'2013-1-1', '2013-1-5', '2013-1-7',
'2013-1-1', '2013-1-7', '2013-1-12']})
df['time'] = pd.to_datetime(df['time'])
This approach requires some kind non-missing numeric column to count with, so just add a dummy one.
df['dummy_numeric'] = 1
My approach to the problem is this. First, groupby the id and iterate so we are working with one user id worth of data at time. Next, resample the irregular data up to daily values so it is normalized.
Then, using the rolling_count function, count the number of observations in each X day window (using 3 here). This works because the upsampled data will be filled with NaN and not counted. Notice that only the numeric column is being passed to rolling_count, and also note the use of double-brackets (which results in a DataFrame being selected rather than a series).
window_days = 3
ids = []
for _, df_gb in df.groupby('id'):
df_gb = df_gb.set_index('time').resample('D')
df_gb = pd.rolling_count(df_gb[['dummy_numeric']], window_days).reset_index()
ids.append(df_gb)
Combine all the data back together, mark the spans with more than observations
df_stack = pd.concat(ids, ignore_index=True)
df_stack['multiple_requests'] = (df_stack['dummy_numeric'] > 1).astype(int)
Then groupby and sum, and you should have the right answer.
df_stack.groupby('time')['multiple_requests'].sum()
Out[356]:
time
2013-01-01 0
2013-01-02 1
2013-01-03 1
2013-01-04 0
2013-01-05 0
2013-01-06 0
2013-01-07 1
2013-01-08 0
2013-01-09 0
2013-01-10 0
2013-01-11 0
2013-01-12 0
Name: multiple, dtype: int32

Categories

Resources