Pandas: aggregate column based on values in a different column - python

Lets say I start with a dataframe that looks like this:
Group Val date
0 home first 2017-12-01
1 home second 2017-12-02
2 away first 2018-03-07
3 away second 2018-03-01
Data types are [string, string, datetime]. I would like to get a dataframe that for each group, shows me the value that was entered most recently:
Group Most rececnt Val Most recent date
0 home second 12-02-2017
1 away first 03-07-2018
(Data types are [string, string, datetime])
My initial thought is that I should be able to do something like this by grouping by 'group' and then aggregating the dates and vals. I know I can get the most recent datetime using the 'max' agg function, but I'm stuck on what function to use to get the corresponding val:
df.groupby('Group').agg({'val':lambda x: ____????____
'date':'max'})
Thanks,

In case I understood you right, you can do this:
df.iloc[df.groupby('Group').agg({'date': 'idxmax'}).date]
Or as a whole example:
import pandas as pd
import numpy as np
np.random.seed(42)
data = [(np.random.choice(['home', 'away'], size=1)[0],
np.random.choice(['first', 'second'], size=1)[0],
pd.Timestamp(np.random.rand()*1.9989e+18)) for i in range(10)]
df = pd.DataFrame.from_records(data)
df.columns = ['Group', 'Val', 'date']
df.iloc[df.groupby('Group').agg({'date': 'idxmax'}).date]
Which selects
Group Val date
5 away first 2031-06-09 06:26:43.486610432
0 home second 2030-03-22 04:07:07.082781440
from
Group Val date
0 home second 2030-03-22 04:07:07.082781440
1 home second 2007-12-03 05:07:24.061456384
2 home second 1979-11-18 23:57:26.700035456
3 home first 2024-11-12 08:18:17.789517824
4 away second 2014-11-07 13:17:55.756515328
5 away first 2031-06-09 06:26:43.486610432
6 away second 1983-06-14 13:17:28.334806208
7 away second 1981-08-14 03:21:14.746028864
8 away second 2003-03-29 11:00:31.189680256
9 away first 1988-06-12 16:58:48.341865984

First select the indeces of the dataframe whose variable value is maximum
max_indeces = df.groupby(['Group'])['date'].idxmax()
and then select the corresponding rows in the original dataframe, maybe only indicating the actual value you are interested in:
df.iloc[max_indeces]['Val']

Related

To get subset of dataframe based on index of a label

I have a dataframe from yahoo finance
import pandas as pd
import yfinance
ticker = yfinance.Ticker("INFY.NS")
df = ticker.history(period = '1y')
print(df)
This gives me df as,
If I specify,
date = "2021-04-23"
I need a subset of df with row having indexes label "2021-04-23"
rows of 2 days before the date
row of 1 day after of date
The important thing here is, we cannot calculate before & after using date strings as df may not have some dates but rows to be printed based on indexes. (i.e. 2 rows of previous indexes and one row of next index)
For example, in df, there is no "2021-04-21" but "2021-04-20"
How can we implement this?
You can go for integer-based indexing. First find the integer location of the desired date and then take the desired subset with iloc:
def get_subset(df, date):
# get the integer index of the matching date(s)
matching_dates_inds, = np.nonzero(df.index == date)
# and take the first one (works in case of duplicates)
first_matching_date_ind = matching_dates_inds[0]
# take the 4-element subset
desired_subset = df.iloc[first_matching_date_ind - 2: first_matching_date_ind + 2]
return desired_subset
If need before and after values by positions (if always exist date in DatetimeIndex) use DataFrame.iloc with position by Index.get_loc with min and max for select rows if not exist values before 2 or after 1 like in sample data:
df = pd.DataFrame({'a':[1,2,3]},
index=pd.to_datetime(['2021-04-21','2021-04-23','2021-04-25']))
date = "2021-04-23"
pos = df.index.get_loc(date)
df = df.iloc[max(0, pos-2):min(len(df), pos+2)]
print (df)
a
2021-04-21 1
2021-04-23 2
2021-04-25 3
Notice:
min and max are added for not failed selecting if date is first (not exist 2 values before, or second - not exist second value before) or last (not exist value after)

pd.merge_asof with aggregation

Given a data frame with start time of a new time period (a new work shift), sum all sales that occur up to next time period (work shift).
import pandas as pd
df_checkpoints = pd.DataFrame({'time':[1,5,10], 'shift':['Adam','Ben','Carl']})
df_sales = pd.DataFrame({'time':[2,6,7,9,15], 'soldCount':[1,2,3,4,5]})
# This is the wanted output...
df_output = pd.DataFrame({'time':[1,5,10], 'shift':['Adam','Ben','Carl'], 'totSold':[1,9,5]})
So pd.merge_asof does what I want except it only does 1:1 merge. Best would be to get a multiIndex dataframe with index[0] being the checkpoints and index[1] being the sales rows, such that I can aggregate freely afterwards. Last resort would be an ugly O(n) loop.
Number of rows in each df is a couple of millions.
Any idea?
You can use pd.cut
For instance if you want to group by range you can use like this.
As you aware I added 24 to show finish of range
pd.cut(df_sales["time"], [1,5,10,24])
If you want to automate this you can use like this:
get your checkpoints, add 24 to finish time, group it, sum sales, reset index for concat
group_and_sum = df_sales.groupby(pd.cut(df_sales["time"], df_checkpoints['time'].append(pd.Series(24))),as_index = False).sum().drop('time',axis=1)
concat 2 dataframes for names
pd.concat([group_and_sum,df_checkpoints],axis=1)
output
soldCount time shift
0 1 1 Adam
1 9 5 Ben
2 5 10 Carl

Fetch previous rows based on if condition and Shift function - Python dataframe

I have data as shown below. I would like to select rows based on two conditions.
1) rows that start with digits (1,2,3 etc)
2) previous row of the records that satisfy 1st condition
Please find the how the input data looks like
Please find how I expect the output to be
I tried using the shift(-1) function but it seems to be throwing error. I am sure I messed up with the logic/syntax. Please find the code below that I tried
# i get the index of all records that start with number.
s=df1.loc[df1['VARIABLE'].str.contains('^\d')==True].index
# now I need to get the previous record of each group but this is
#incorrect
df1.loc[((df1['VARIABLE'].shift(-1).str.contains('^\d')==False) &
(df1['VARIABLE'].str.contains('^\d')==True))].index
Use:
df1 = pd.DataFrame({'VARIABLE':['studyid',np.nan,'age_interview','Gender','1.Male',
'2.Female',np.nan, 'dob', 'eth',
'Ethnicity','1.Chinese','2.Indian','3.Malay']})
#first remove missing rows by column VARIABLE
df1 = df1.dropna(subset=['VARIABLE'])
#test startinf numbers
s = (df1['VARIABLE'].str.contains('^\d'))
#chain shifted values by | for OR
mask = s | s.shift(-1)
#filtering by boolean indexing
df1 = df1[mask]
print (df1)
VARIABLE
3 Gender
4 1.Male
5 2.Female
9 Ethnicity
10 1.Chinese
11 2.Indian
12 3.Malay

Finding rows with highest means in dataframe

I am trying to find the rows, in a very large dataframe, with the highest mean.
Reason: I scan something with laser trackers and used a "higher" point as reference to where the scan starts. I am trying to find the object placed, through out my data.
I have calculated the mean of each row with:
base = df.mean(axis=1)
base.columns = ['index','Mean']
Here is an example of the mean for each row:
0 4.407498
1 4.463597
2 4.611886
3 4.710751
4 4.742491
5 4.580945
This seems to work fine, except that it adds an index column, and gives out columns with an index of type float64.
I then tried this to locate the rows with highest mean:
moy = base.loc[base.reset_index().groupby(['index'])['Mean'].idxmax()]
This gives out tis :
index Mean
0 0 4.407498
1 1 4.463597
2 2 4.611886
3 3 4.710751
4 4 4.742491
5 5 4.580945
But it only re-index (I have now 3 columns instead of two) and does nothing else. It still shows all rows.
Here is one way without using groupby
moy=base.sort_values('Mean').tail(1)
It looks as though your data is a string or single column with a space in between your two numbers. Suggest splitting the column into two and/or using something similar to below to set the index to your specific column of interest.
import pandas as pd
df = pd.read_csv('testdata.txt', names=["Index", "Mean"], delimiter="\s+")
df = df.set_index("Index")
print(df)

Filling missing time values in a multi-indexed dataframe

Problem and what I want
I have a data file that comprises time series read asynchronously from multiple sensors. Basically for every data element in my file, I have a sensor ID and time at which it was read, but I do not always have all sensors for every time, and read times may not be evenly spaced. Something like:
ID,time,data
0,0,1
1,0,2
2,0,3
0,1,4
2,1,5 # skip some sensors for some time steps
0,2,6
2,2,7
2,3,8
1,5,9 # skip some time steps
2,5,10
Important note the actual time column is of datetime type.
What I want is to be able to zero-order hold (forward fill) values for every sensor for any time steps where that sensor does not exist, and either set to zero or back fill any sensors that are not read at the earliest time steps. What I want is a dataframe that looks like it was read from:
ID,time,data
0,0,1
1,0,2
2,0,3
0,1,4
1,1,2 # ID 1 hold value from time step 0
2,1,5
0,2,6
1,2,2 # ID 1 still holding
2,2,7
0,3,6 # ID 0 holding
1,3,2 # ID 1 still holding
2,3,8
0,5,6 # ID 0 still holding, can skip totally missing time steps
1,5,9 # ID 1 finally updates
2,5,10
Pandas attempts so far
I initialize my dataframe and set my indices:
df = pd.read_csv(filename, dtype=np.int)
df.set_index(['ID', 'time'], inplace=True)
I try to mess with things like:
filled = df.reindex(method='ffill')
or the like with various values passed to the index keyword argument like df.index, ['time'], etc. This always either throws an error because I passed an invalid keyword argument, or does nothing visible to the dataframe. I think it is not recognizing that the data I am looking for is "missing".
I also tried:
df.update(df.groupby(level=0).ffill())
or level=1 based on Multi-Indexed fillna in Pandas, but I get no visible change to the dataframe again, I think because I don't have anything currently where I want my values to go.
Numpy attempt so far
I have had some luck with numpy and non-integer indexing using something like:
data = [np.array(df.loc[level].data) for level in df.index.levels[0]]
shapes = [arr.shape for arr in data]
print(shapes)
# [(3,), (2,), (5,)]
data = [np.array([arr[i] for i in np.linspace(0, arr.shape[0]-1, num=max(shapes)[0])]) for arr in data]
print([arr.shape for arr in data])
# [(5,), (5,), (5,)]
But this has two problems:
It takes me out of the pandas world, and I now have to manually maintain my sensor IDs, time index, etc. along with my feature vector (the actual data column is not just one column but a ton of values from a sensor suite).
Given the number of columns and the size of the actual dataset, this is going to be clunky and inelegant to implement on my real example. I would prefer a way of doing it in pandas.
The application
Ultimately this is just the data-cleaning step for training recurrent neural network, where for each time step I will need to feed a feature vector that always has the same structure (one set of measurements for each sensor ID for each time step).
Thank you for your help!
Here is one way , by using reindex and category
df.time=df.time.astype('category',categories =[0,1,2,3,4,5])
new_df=df.groupby('time',as_index=False).apply(lambda x : x.set_index('ID').reindex([0,1,2])).reset_index()
new_df['data']=new_df.groupby('ID')['data'].ffill()
new_df.drop('time',1).rename(columns={'level_0':'time'})
Out[311]:
time ID data
0 0 0 1.0
1 0 1 2.0
2 0 2 3.0
3 1 0 4.0
4 1 1 2.0
5 1 2 5.0
6 2 0 6.0
7 2 1 2.0
8 2 2 7.0
9 3 0 6.0
10 3 1 2.0
11 3 2 8.0
12 4 0 6.0
13 4 1 2.0
14 4 2 8.0
15 5 0 6.0
16 5 1 9.0
17 5 2 10.0
You can have a dictionary of last readings for each sensors. You'll have to pick some initial value; the most logical choice is probably to back-fill the earliest reading to earlier times. Once you've populated your last_reading dictionary, you can just sort all the readings by time, update the dictionary for each reading, and then fill in rows according to the dictionay. So after you have your last_reading dictionary initialized:
last_time = readings[1][time]
for reading in readings:
if reading[time] > last_time:
for ID in ID_list:
df.loc[last_time,ID] = last_reading[ID]
last_time = reading[time]
last_reading[reading[ID]] = reading[data]
#the above for loop doesn't update for the last time
#so you'll have to handle that separately
for ID in ID_list:
df.loc[last_time,ID] = last_reading[ID]
last_time = reading[time]
This assumes that you have only one reading for each time/sensor pair, and that 'readings' a list of dictionaries sorted by time. It also assumes that df has the different sensors as columns and different times as index. Adjust the code as necessary if otherwise. You can also probably optimize it a bit more by updating a whole row at once instead of using a for loop, but I didn't want to deal with making sure I had the Pandas syntax right.
Looking at the application, though, you might want to have each cell in the dataframe be not a number but a tuple of last value and time it was read, so replace last_reading[reading[ID]] = reading[data] with
last_reading[reading[ID]] = [reading[data],reading[time]]. Your neural net can then decide how to weight data based on how old it is.
I got this to work with the following, which I think is pretty general for any case like this where the time index for which you want to fill values is the second in a multi-index with two indices:
# Remove duplicate time indices (happens some in the dataset, pandas freaks out).
df = df[~df.index.duplicated(keep='first')]
# Unstack the dataframe and fill values per serial number forward, backward.
df = df.unstack(level=0)
df.update(df.ffill()) # first ZOH forward
df.update(df.bfill()) # now back fill values that are not seen at the beginning
# Restack the dataframe and re-order the indices.
df = df.stack(level=1)
df = df.swaplevel()
This gets me what I want, although I would love to be able to keep the duplicate time entries if anybody knows of a good way to do this.
You could also use df.update(df.fillna(0)) instead of backfilling if starting unseen values at zero is preferable for a particular application.
I put the above code block in a function called clean_df that takes the dataframe as argument and returns the cleaned dataframe.

Categories

Resources