To get subset of dataframe based on index of a label - python

I have a dataframe from yahoo finance
import pandas as pd
import yfinance
ticker = yfinance.Ticker("INFY.NS")
df = ticker.history(period = '1y')
print(df)
This gives me df as,
If I specify,
date = "2021-04-23"
I need a subset of df with row having indexes label "2021-04-23"
rows of 2 days before the date
row of 1 day after of date
The important thing here is, we cannot calculate before & after using date strings as df may not have some dates but rows to be printed based on indexes. (i.e. 2 rows of previous indexes and one row of next index)
For example, in df, there is no "2021-04-21" but "2021-04-20"
How can we implement this?

You can go for integer-based indexing. First find the integer location of the desired date and then take the desired subset with iloc:
def get_subset(df, date):
# get the integer index of the matching date(s)
matching_dates_inds, = np.nonzero(df.index == date)
# and take the first one (works in case of duplicates)
first_matching_date_ind = matching_dates_inds[0]
# take the 4-element subset
desired_subset = df.iloc[first_matching_date_ind - 2: first_matching_date_ind + 2]
return desired_subset

If need before and after values by positions (if always exist date in DatetimeIndex) use DataFrame.iloc with position by Index.get_loc with min and max for select rows if not exist values before 2 or after 1 like in sample data:
df = pd.DataFrame({'a':[1,2,3]},
index=pd.to_datetime(['2021-04-21','2021-04-23','2021-04-25']))
date = "2021-04-23"
pos = df.index.get_loc(date)
df = df.iloc[max(0, pos-2):min(len(df), pos+2)]
print (df)
a
2021-04-21 1
2021-04-23 2
2021-04-25 3
Notice:
min and max are added for not failed selecting if date is first (not exist 2 values before, or second - not exist second value before) or last (not exist value after)

Related

How to search for a specific date within concatenated DataFrame TimeSeries. Same Date would repeat several times in a merged df

I downloaded historical price data for ^GSPC Share Market Index (S&P500), and several other Global Indices. Date is set as index.
Selecting values in rows when date is set to index works as expected with .loc.
# S&P500 DataFrame = spx_df
spx_df.loc['2010-01-04']
Open 1.116560e+03
High 1.133870e+03
Low 1.116560e+03
Close 1.132990e+03
Volume 3.991400e+09
Dividends 0.000000e+00
Stock Splits 0.000000e+00
Name: 2010-01-04 00:00:00-05:00, dtype: float64
I then concatenated several Stock Market Global Indices into a single DataFrame for further use. In effect, any date in range will be included five times when historical data for five Stock Indices are linked in a Time Series.
markets = pd.concat(ticker_list, axis = 0)
I want to reference a single date in concatenated df and set it as a variable. I would prefer if the said variable didn't represent a datetime object, because I would like to access it with .loc as part of def function. How does concatenate effect accessing rows via date as index if the same date repeats several times in a linked TimeSeries?
This is what I attempted so far:
# markets = concatenated DataFrame
Reference_date = markets.loc['2010-01-04']
# KeyError: '2010-01-04'
Reference_date = markets.loc[markets.Date == '2010-01-04']
# This doesn't work because Date is not an attribute of the DataFrame
Since you have set date as index you should be able to do:
Reference_date = markets.loc[markets.index == '2010-01-04']
To access a specific date in the concatenated DataFrame, you can use boolean indexing instead of .loc. This will return a DataFrame that contains all rows where the date equals the reference date:
reference_date = markets[markets.index == '2010-01-04']
You may also want to use query() method for searching for specific data
reference_date = markets.query('index == "2010-01-04"')
Keep in mind that the resulting variable reference_date is still a DataFrame and contains all rows that match the reference date across all the concatenated DataFrames. If you want to extract only specific columns, you can use the column name like this:
reference_date_Open = markets.query('index == "2010-01-04"')["Open"]

Pandas delete duplicate rows based on timestamp

I have a dataset where I have multiple duplicate records based on timestamps for the same date. I want to keep the record with the max timestamp and delete the other records for a given ID and timestamp combo.
Sample dataset
id|timestamp|value
--|---------|-----
1|2022-04-19T18:46:36.259+0000|xyz
1|2022-04-19T18:46:36.302+0000|xyz
1|2022-04-19T18:46:36.357+0000|xyz
1|2022-04-24T00:41:40:871+0000|xyz
1|2022-04-24T00:41:40:879+0000|xyz
1|2022-05-02T10:15:25.829+0000|xyz
1|2022-05-02T10:15:25.832+0000|xyz
Final Df
id|timestamp|value
--|---------|-----
1|2022-04-19T18:46:36.357+0000|xyz
1|2022-04-24T00:41:40:879+0000|xyz
1|2022-05-02T10:15:25.832+0000|xyz
if you add the data as a code, it'll be easier to share the result. Since you already have a data, its simpler to post it as a code or text
# To keep the lastdate but latest timestamp
# create a dateonly field from timestamp, in identifying the dupicates
# sort values so, we have latest timestamp for an id at the end
# drop duplicates based on id and timestamp. keeping last row
# finally drop the temp column
(df.assign(d=pd.to_datetime(df['timestamp']).dt.date)
.sort_values(['id','timestamp'])
.drop_duplicates(subset=['id','d'], keep='last')
.drop(columns='d')
)
id timestamp value
2 1 2022-04-19T18:46:36.357+0000 xyz
4 1 2022-04-24T00:41:40.879+0000 xyz
6 1 2022-05-02T10:15:25.832+0000 xyz
a combination of .groupby and .max will do
import pandas as pd
dates = pd.to_datetime(['01-01-1990', '01-02-1990', '01-02-1990', '01-03-1990'])
values = [1] * len(dates)
ids = values[:]
df = pd.DataFrame(zip(dates, values, ids), columns=['timestamp', 'val', 'id'])
selection = df.groupby(['val', 'id'])['timestamp'].max().reset_index()
print(selection)
output
val id timestamp
0 1 1 1990-01-03
You can use following code for your task.
df.groupby(["id","value"]).max()
explanation: Fist group by using id and value column and then select only the maximum.

Compare column value at one time to another pandas datetime index

I have a pandas dataframe with a datetime index and some column, 'value'. I would like to compare the 'value' value at a given time of day to the value at a different time of the same day. E.g. compare the 10am value to the 10pm value.
Right now I can get the value at either side using:
mask = df[(df.index.hour == hour)]
the problem is that this returns a dataframe indexed at hour. So doing mask1.value - mask2.value returns Nan's since the indexes are different.
I can get around this in a convoluted way:
out = mask.value.loc["2020-07-15"].reset_index() - mask2.value.loc["2020-07-15"].reset_index() #assuming mask2 is the same as the mask call but at a different hour
but this is tiresome to loop over for a dataset that spans years. (Obviously I could timedelta +=1 in the loop to avoid the hard calls).
I don't actually care if some nan's get into the end result if some, e.g. 10am, values are missing.
Edit:
Initial dataframe:
index values
2020-05-10T10:00:00 23
2020-05-10T11:00:00 20
2020-05-10T12:00:00 5
.....
2020-05-30T22:00:00 8
2020-05-30T23:00:00 8
2020-05-30T24:00:00 9
Expected dataframe:
index date newval
0 2020-05-10 18
.....
x 2020-05-30 1
where newval is some subtraction of the two different times I described above (eg. the 10am measurement - the 12pm measurement so 23-5 = 18), second entry is made up
it doesn't matter to me if date is a separate column or the index.
A workaround:
mask1 = df[(df.index.hour == hour1)]
mask2 = df[(df.index.hour == hour2)]
out = mask1.values - mask2.values # df.values returns an np array without indices
result_df = pd.DataFrame(index=pd.daterange(start,end), data=out)
It should save you the effort of looping over the dates

Fetch previous rows based on if condition and Shift function - Python dataframe

I have data as shown below. I would like to select rows based on two conditions.
1) rows that start with digits (1,2,3 etc)
2) previous row of the records that satisfy 1st condition
Please find the how the input data looks like
Please find how I expect the output to be
I tried using the shift(-1) function but it seems to be throwing error. I am sure I messed up with the logic/syntax. Please find the code below that I tried
# i get the index of all records that start with number.
s=df1.loc[df1['VARIABLE'].str.contains('^\d')==True].index
# now I need to get the previous record of each group but this is
#incorrect
df1.loc[((df1['VARIABLE'].shift(-1).str.contains('^\d')==False) &
(df1['VARIABLE'].str.contains('^\d')==True))].index
Use:
df1 = pd.DataFrame({'VARIABLE':['studyid',np.nan,'age_interview','Gender','1.Male',
'2.Female',np.nan, 'dob', 'eth',
'Ethnicity','1.Chinese','2.Indian','3.Malay']})
#first remove missing rows by column VARIABLE
df1 = df1.dropna(subset=['VARIABLE'])
#test startinf numbers
s = (df1['VARIABLE'].str.contains('^\d'))
#chain shifted values by | for OR
mask = s | s.shift(-1)
#filtering by boolean indexing
df1 = df1[mask]
print (df1)
VARIABLE
3 Gender
4 1.Male
5 2.Female
9 Ethnicity
10 1.Chinese
11 2.Indian
12 3.Malay

Pandas: aggregate column based on values in a different column

Lets say I start with a dataframe that looks like this:
Group Val date
0 home first 2017-12-01
1 home second 2017-12-02
2 away first 2018-03-07
3 away second 2018-03-01
Data types are [string, string, datetime]. I would like to get a dataframe that for each group, shows me the value that was entered most recently:
Group Most rececnt Val Most recent date
0 home second 12-02-2017
1 away first 03-07-2018
(Data types are [string, string, datetime])
My initial thought is that I should be able to do something like this by grouping by 'group' and then aggregating the dates and vals. I know I can get the most recent datetime using the 'max' agg function, but I'm stuck on what function to use to get the corresponding val:
df.groupby('Group').agg({'val':lambda x: ____????____
'date':'max'})
Thanks,
In case I understood you right, you can do this:
df.iloc[df.groupby('Group').agg({'date': 'idxmax'}).date]
Or as a whole example:
import pandas as pd
import numpy as np
np.random.seed(42)
data = [(np.random.choice(['home', 'away'], size=1)[0],
np.random.choice(['first', 'second'], size=1)[0],
pd.Timestamp(np.random.rand()*1.9989e+18)) for i in range(10)]
df = pd.DataFrame.from_records(data)
df.columns = ['Group', 'Val', 'date']
df.iloc[df.groupby('Group').agg({'date': 'idxmax'}).date]
Which selects
Group Val date
5 away first 2031-06-09 06:26:43.486610432
0 home second 2030-03-22 04:07:07.082781440
from
Group Val date
0 home second 2030-03-22 04:07:07.082781440
1 home second 2007-12-03 05:07:24.061456384
2 home second 1979-11-18 23:57:26.700035456
3 home first 2024-11-12 08:18:17.789517824
4 away second 2014-11-07 13:17:55.756515328
5 away first 2031-06-09 06:26:43.486610432
6 away second 1983-06-14 13:17:28.334806208
7 away second 1981-08-14 03:21:14.746028864
8 away second 2003-03-29 11:00:31.189680256
9 away first 1988-06-12 16:58:48.341865984
First select the indeces of the dataframe whose variable value is maximum
max_indeces = df.groupby(['Group'])['date'].idxmax()
and then select the corresponding rows in the original dataframe, maybe only indicating the actual value you are interested in:
df.iloc[max_indeces]['Val']

Categories

Resources