Rolling Status from Pandas Dataframe

Rolling Status from Pandas Dataframe - python

+------+--------+------------+------------+---+---+---+
| area | locale | date | end date | i | t | o |
+------+--------+------------+------------+---+---+---+
| abc | abc25 | 2001-03-01 | 2001-04-01 | 1 | | |
| abc | abc25 | 2001-04-01 | 2001-05-01 | 1 | | |
| abc | abc25 | 2001-05-01 | 2001-06-01 | 1 | | |
| abc | abc25 | 2001-06-01 | 2001-07-01 | | 1 | |
| abc | abc25 | 2001-07-01 | 2001-08-01 | | | 1 |
| abc | abc25 | 2001-08-01 | 2001-09-01 | | 1 | |
| abc | abc25 | 2001-09-01 | 2001-05-01 | | 1 | |
| abc | abc25 | 2001-10-01 | 2001-11-01 | | 1 | |
| abc | abc25 | 2001-11-01 | 2001-12-01 | | | 1 |
| abc | abc25 | 2001-12-01 | | | | 1 |
| def | def25 | 2001-03-01 | 2001-04-01 | | | 1 |
| def | def25 | 2001-04-01 | 2001-05-01 | | | 1 |
| def | def25 | 2001-05-01 | 2001-06-01 | | | 1 |
| def | def25 | 2001-06-01 | 2001-07-01 | | 1 | |
| def | def25 | 2001-07-01 | 2001-08-01 | | 1 | |
| def | def25 | 2001-08-01 | 2001-09-01 | 1 | | |
| def | def25 | 2001-09-01 | 2001-05-01 | 1 | | |
| def | def25 | 2001-10-01 | 2001-11-01 | | 1 | |
| def | def25 | 2001-11-01 | 2001-12-01 | | | 1 |
| def | def25 | 2001-12-01 | | | | 1 |
+------+--------+------------+------------+---+---+---+
Here is the data table sample that I am working with. What I am attempting to do is a status column added on here. The status column is a bit tricky though and here is the criteria:
If any 2 periods of time are the same i/t/o then they get their associated status (let's say R/Y/G)
If you have two different statuses you choose "best"
Example Output:
+------+--------+------------+------------+---+---+---+--------+
| area | locale | date | end date | i | t | o | Status |
+------+--------+------------+------------+---+---+---+--------+
| abc | abc25 | 2001-03-01 | 2001-04-01 | 1 | | | NONE |
| abc | abc25 | 2001-04-01 | 2001-05-01 | 1 | | | R |
| abc | abc25 | 2001-05-01 | 2001-06-01 | 1 | | | R |
| abc | abc25 | 2001-06-01 | 2001-07-01 | | 1 | | Y |
| abc | abc25 | 2001-07-01 | 2001-08-01 | | | 1 | G |
| abc | abc25 | 2001-08-01 | 2001-09-01 | | 1 | | G |
| abc | abc25 | 2001-09-01 | 2001-05-01 | | 1 | | Y |
| abc | abc25 | 2001-10-01 | 2001-11-01 | | 1 | | Y |
| abc | abc25 | 2001-11-01 | 2001-12-01 | | | 1 | G |
| abc | abc25 | 2001-12-01 | | | | 1 | G |
| def | def25 | 2001-03-01 | 2001-04-01 | | | 1 | NONE |
| def | def25 | 2001-04-01 | 2001-05-01 | | | 1 | G |
| def | def25 | 2001-05-01 | 2001-06-01 | | | 1 | G |
| def | def25 | 2001-06-01 | 2001-07-01 | | 1 | | G |
| def | def25 | 2001-07-01 | 2001-08-01 | | 1 | | Y |
| def | def25 | 2001-08-01 | 2001-09-01 | 1 | | | Y |
| def | def25 | 2001-09-01 | 2001-05-01 | 1 | | | R |
| def | def25 | 2001-10-01 | 2001-11-01 | | 1 | | Y |
| def | def25 | 2001-11-01 | 2001-12-01 | | | 1 | G |
| def | def25 | 2001-12-01 | | | | 1 | G |
+------+--------+------------+------------+---+---+---+--------+
Now I looked up pandas rolling, but that might not be the best approach; I tried the following:
df.groupby('locale')['o'].rolling(2).sum()
which works on it's own, but I can't seem to create a column out of it so I can say if that == 2 then it is whatever status. I also tried to just use this in an if statement:
if df.groupby('locale')['o'].rolling(2).sum() == 2.0 :
df['locale_status'] = 'Green'
this gives an error about the truth value of a series
I also tried :
if df.groupby('locale')['o'] == df.groupby('locale')['o'].shift() : df['test'] = 'Green'
This results in an invalid type comparison.

I don't think this problem lends itself to vectorization/Pandas efficiency, but I'd love to be proven wrong by one of the ninjas on here. My solution involves some prep from pd.read_clipboard() you probably don't need.
Basically I replaced blanks with 0, used idxmax to get the 'current' letter, and found if there's a streak. I then looped through the rows to find the 'best' or 'streak', inside a groupby.
#data cleaning - from clipboard, prob irrelevant to OP
df=pd.read_clipboard(sep='|', engine='python', header=1)
df=df.reset_index().iloc[1:-1,1:-1]
df=df.rename(columns={ ' i ':'i',' t ':'t',' o ':'o',})
df=df.drop('Unnamed: 0',1)
df=df.replace(' ', 0)
df['current'] = df[['i','t','o']].astype(int).idxmax(1)
df['streak'] = df['current'] == df['current'].shift(1)
weights = {'i':0, 't':1, 'o':2}
results = []
for val in df[' area '].unique():
temp = df.loc[df.groupby(' area ').groups[val]].reset_index(drop=True)
winner = []
for idx, row in temp.iterrows():
if idx == 0:
winner.append(np.nan)
else:
current = row['current']
if row['streak']:
winner.append(current)
else:
last = temp.loc[idx-1, 'current']
if weights[last] > weights[current]:
winner.append(last)
else:
winner.append(current)
temp['winner'] = winner
results.append(temp)
res = pd.concat(results)
res['winner'] = res['winner'].map({'i':'R','t':'Y','o':'G'})

Related

Match dtypes of two DataFrames that share columns

I have the following dataframes in pandas:
df:
| ID | country | money | code | money_add | other |
| -------- | -------------- | --------- | -------- | --------- | ----- |
| 832932 | Other | NaN | 00000 | NaN | NaN |
| 217#8# | NaN | NaN | NaN | NaN | NaN |
| 1329T2 | France | 12131 | 00020 | 3452 | 123 |
| 124932 | France | NaN | 00016 | NaN | NaN |
| 194022 | France | NaN | 00000 | NaN | NaN |
df1:
| cod_t | money | money_add | other |
| -------- | ------ | --------- | ----- |
| 00000 | 4532 | 72323 | 321 |
| 00016 | 1213 | 23822 | 843 |
| 00018 | 1313 | 8393 | 183 |
| 00020 | 1813 | 27328 | 128 |
| 00030 | 8932 | 3204 | 829 |
cols = df.columns.intersection(df1.columns)
print (df[cols].dtypes.eq(df1[cols].dtypes))
money False
money_add False
other False
dtype: bool
I want to match the dtypes of the columns of the second dataframe to be equal to those of the first one. Is there any way to do this?

try:
for i in df1.columns.tolist():
df1[f'{i}'] = df1[f'{i}'].astype(df[f'{i}'].dtype)

shift below cells to count for R

I am using the code below to produce following result in Python and I want equivalent for this code on R.
here N is the column of dataframe data . CN column is calculated from values of column N with a specific pattern and it gives me following result in python.
+---+----+
| N | CN |
+---+----+
| 0 | 0 |
| 1 | 1 |
| 1 | 1 |
| 2 | 2 |
| 2 | 2 |
| 0 | 3 |
| 0 | 3 |
| 1 | 4 |
| 1 | 4 |
| 1 | 4 |
| 2 | 5 |
| 2 | 5 |
| 3 | 6 |
| 4 | 7 |
| 0 | 8 |
| 1 | 9 |
| 2 | 10 |
+---+----+
a short overview of my code is
data = pd.read_table(filename,skiprows=15,decimal=',', sep='\t',header=None,names=["Date ","Heure ","temps (s) ","X","Z"," LVDT V(mm) " ,"Force normale (N) ","FT","FN(N) ","TS"," NS(kPa) ","V (mm/min)","Vitesse normale (mm/min)","e (kPa)","k (kPa/mm) " ,"N " ,"Nb cycles normal" ,"Cycles " ,"Etat normal" ,"k imposÈ (kPa/mm)"])
data.columns = [col.strip() for col in data.columns.tolist()]
N = data[data.keys()[15]]
N = np.array(N)
data["CN"] = (data.N.shift().bfill() != data.N).astype(int).cumsum()
an example of data.head() is here
+-------+-------------+------------+-----------+----------+----------+------------+-------------------+-----------+-------------+-----------+------------+------------+--------------------------+------------+------------+-----+------------------+--------+-------------+-------------------+----+
| Index | Date | Heure | temps (s) | X | Z(mm) | LVDT V(mm) | Force normale (N) | FT | FN(N) | FT (kPa) | NS(kPa) | V (mm/min) | Vitesse normale (mm/min) | e (kPa) | k (kPa/mm) | N | Nb cycles normal | Cycles | Etat normal | k imposÈ (kPa/mm) | CN |
+-------+-------------+------------+-----------+----------+----------+------------+-------------------+-----------+-------------+-----------+------------+------------+--------------------------+------------+------------+-----+------------------+--------+-------------+-------------------+----+
| 184 | 01/02/2022 | 12:36:52 | 402.163 | 6.910243 | 1.204797 | 0.001101 | 299.783665 | 31.494351 | 1428.988908 | 11.188704 | 505.825016 | 0.1 | 2.0 | 512.438828 | 50.918786 | 0.0 | 0.0 | Sort | Monte | 0.0 | 0 |
| 185 | 01/02/2022 | 12:36:54 | 404.288 | 6.907822 | 1.205647 | 4.9e-05 | 296.072718 | 31.162313 | 1404.195316 | 11.028167 | 494.97955 | 0.1 | -2.0 | 500.084986 | 49.685639 | 0.0 | 0.0 | Sort | Descend | 0.0 | 0 |
| 186 | 01/02/2022 | 12:36:56 | 406.536 | 6.907906 | 1.204194 | -0.000214 | 300.231424 | 31.586401 | 1429.123486 | 11.21895 | 505.750815 | 0.1 | 2.0 | 512.370164 | 50.914002 | 0.0 | 0.0 | Sort | Monte | 0.0 | 0 |
| 187 | 01/02/2022 | 12:36:58 | 408.627 | 6.910751 | 1.204293 | -0.000608 | 300.188686 | 31.754064 | 1428.979519 | 11.244542 | 505.624564 | 0.1 | 2.0 | 512.309254 | 50.906544 | 0.0 | 0.0 | Sort | Monte | 0.0 | 0 |
| 188 | 01/02/2022 | 12:37:00 | 410.679 | 6.907805 | 1.205854 | -0.000181 | 296.358074 | 31.563389 | 1415.224427 | 11.129375 | 502.464948 | 0.1 | 2.0 | 510.702313 | 50.742104 | 0.0 | 0.0 | Sort | Monte | 0.0 | 0 |
+-------+-------------+------------+-----------+----------+----------+------------+-------------------+-----------+-------------+-----------+------------+------------+--------------------------+------------+------------+-----+------------------+--------+-------------+-------------------+----+

A one line cumsum trick solves it.
cumsum(c(0L, diff(df1$N) != 0))
#> [1] 0 1 1 2 2 3 3 4 4 4 5 5 6 7 8 9 10
all.equal(
cumsum(c(0L, diff(df1$N) != 0)),
df1$CN
)
#> [1] TRUE
Created on 2022-02-14 by the reprex package (v2.0.1)
Data
x <- "
+---+----+
| N | CN |
+---+----+
| 0 | 0 |
| 1 | 1 |
| 1 | 1 |
| 2 | 2 |
| 2 | 2 |
| 0 | 3 |
| 0 | 3 |
| 1 | 4 |
| 1 | 4 |
| 1 | 4 |
| 2 | 5 |
| 2 | 5 |
| 3 | 6 |
| 4 | 7 |
| 0 | 8 |
| 1 | 9 |
| 2 | 10 |
+---+----+"
df1 <- read.table(textConnection(x), header = TRUE, sep = "|", comment.char = "+")[2:3]
Created on 2022-02-14 by the reprex package (v2.0.1)

Add columns to DataFrame with difference of specific columns based on values of another column

I have a dataframe that looks something like this the following:
+------------+------------------+--------+-----+-----+---+--------+-----------------------------+
| B_date | B_Time | F_Type | Fix | Est | S | C_Type | C_Time |
+------------+------------------+--------+-----+-----+---+--------+-----------------------------+
| 2019-07-22 | 16:42:27.7325458 | 1 | 100 | 100 | 2 | 2 | 2019-07-22 16:42:47.2129273 |
| 2019-07-22 | 16:44:04.7817750 | 1 | 100 | 100 | 2 | 2 | 2019-07-22 16:45:26.2923547 |
| 2019-07-22 | 16:48:21.5976290 | 1 | 100 | 100 | 7 | | |
| 2019-07-23 | 13:11:20.4519581 | 1 | 100 | 100 | 7 | | |
| 2019-07-23 | 13:28:49.5092331 | 1 | 100 | 100 | 2 | 2 | 2019-07-23 13:28:54.5274793 |
| 2019-07-23 | 13:29:06.6108796 | 1 | 100 | 100 | 2 | 2 | 2019-07-23 13:30:48.5358081 |
| 2019-07-23 | 13:31:12.7684213 | 1 | 100 | 100 | 2 | 3 | 2019-07-23 13:33:50.9405643 |
| 2019-07-25 | 09:32:12.7799801 | 1 | 105 | 105 | 7 | | |
| 2019-07-25 | 09:57:58.4536238 | 1 | 158 | 158 | 4 | | |
| 2019-07-25 | 10:03:22.7888221 | 1 | 152 | 152 | 2 | 2 | 2019-07-25 10:03:27.9576175 |
+------------+------------------+--------+-----+-----+---+--------+-----------------------------+
I need to get output as follows:
+------------+-------------------------------+--------+-----+-----+---+--------+-------------------------------+---------------+-----------------+---------------+
| B_date | B_Time | F_Type | Fix | Est | S | C_Type | C_Time | cancel_diff_1 | cancel_diff_2 | cancel_diff_3 |
+------------+-------------------------------+--------+-----+-----+---+--------+-------------------------------+---------------+-----------------+---------------+
| 2019-07-22 | 2019-07-22 16:42:27.732545800 | 1 | 100 | 100 | 2 | 2 | 2019-07-22 16:42:47.212927300 | NaT | 00:00:19.480381 | NaT |
| 2019-07-22 | 2019-07-22 16:44:04.781775000 | 1 | 100 | 100 | 2 | 2 | 2019-07-22 16:45:26.292354700 | NaT | 00:01:21.510579 | NaT |
| 2019-07-22 | 2019-07-22 16:48:21.597629000 | 1 | 100 | 100 | 7 | NaN | NaT | NaT | NaT | NaT |
| 2019-07-23 | 2019-07-23 13:11:20.451958100 | 1 | 100 | 100 | 7 | NaN | NaT | NaT | NaT | NaT |
| 2019-07-23 | 2019-07-23 13:28:49.509233100 | 1 | 100 | 100 | 2 | 2 | 2019-07-23 13:28:54.527479300 | NaT | 00:00:05.018246 | NaT |
+------------+-------------------------------+--------+-----+-----+---+--------+-------------------------------+---------------+-----------------+---------------+
I have actually done it using a function but it and assigning and checking for values which you can say is a python way, I want to do it in simple pandas.

IIUC try this:
df['B_Time']=df['B_Date']+' '+df['B_Time']
df['B_Time']=pd.to_datetime(df['B_Time'])
df.loc[df['C_Type']==1.0, 'diff_1']=df.loc[df['C_Type']==1, 'C_Time']-df.loc[df['C_Time']==1, 'B_Time']
df.loc[df['C_Type']==2.0, 'diff_2']=df.loc[df['C_Type']==2, 'C_Time']-df.loc[df['C_Time']==2, 'B_Time']
df.loc[df['C_Type']==3.0, 'diff_3']=df.loc[df['C_Type']==3, 'C_Time']-df.loc[df['C_Time']==3, 'B_Time']

Multi-Index Lookup Mapping

I'm trying to create a new column which has a value based on 2 indices of that row. I have 2 dataframes with equivalent multi-index on the levels I'm querying (but not of equal size). For each row in the 1st dataframe, I want the value of the 2nd df that matches the row's indices.
I originally thought perhaps I could use a .loc[] and filter off the index values, but I cannot seem to get this to change the output row-by-row. If I wasn't using a dataframe object, I'd loop over the whole thing to do it.
I have tried to use the .apply() method, but I can't figure out what function to pass to it.
Creating some toy data with the same structure:
#import pandas as pd
#import numpy as np
np.random.seed = 1
df = pd.DataFrame({'Aircraft':np.ones(15),
'DC':np.append(np.repeat(['A','B'], 7), 'C'),
'Test':np.array([10,10,10,10,10,10,20,10,10,10,10,10,10,20,10]),
'Record':np.array([1,2,3,4,5,6,1,1,2,3,4,5,6,1,1]),
# There are multiple "value" columns in my data, but I have simplified here
'Value':np.random.random(15)
}
)
df.set_index(['Aircraft', 'DC', 'Test', 'Record'], inplace=True)
df.sort_index(inplace=True)
v = pd.DataFrame({'Aircraft':np.ones(7),
'DC':np.repeat('v',7),
'Test':np.array([10,10,10,10,10,10,20]),
'Record':np.array([1,2,3,4,5,6,1]),
'Value':np.random.random(7)
}
)
v.set_index(['Aircraft', 'DC', 'Test', 'Record'], inplace=True)
v.sort_index(inplace=True)
df['v'] = df.apply(lambda x: v.loc[df.iloc[x]])
Returns error for indexing on multi-index.
To set all values to a single "v" value:
df['v'] = float(v.loc[(slice(None), 'v', 10, 1), 'Value'])
So inputs look like this:
--------------------------------------------
| Aircraft | DC | Test | Record | Value |
|----------|----|------|--------|----------|
| 1.0 | A | 10 | 1 | 0.847576 |
| | | | 2 | 0.860720 |
| | | | 3 | 0.017704 |
| | | | 4 | 0.082040 |
| | | | 5 | 0.583630 |
| | | | 6 | 0.506363 |
| | | 20 | 1 | 0.844716 |
| | B | 10 | 1 | 0.698131 |
| | | | 2 | 0.112444 |
| | | | 3 | 0.718316 |
| | | | 4 | 0.797613 |
| | | | 5 | 0.129207 |
| | | | 6 | 0.861329 |
| | | 20 | 1 | 0.535628 |
| | C | 10 | 1 | 0.121704 |
--------------------------------------------
--------------------------------------------
| Aircraft | DC | Test | Record | Value |
|----------|----|------|--------|----------|
| 1.0 | v | 10 | 1 | 0.961791 |
| | | | 2 | 0.046681 |
| | | | 3 | 0.913453 |
| | | | 4 | 0.495924 |
| | | | 5 | 0.149950 |
| | | | 6 | 0.708635 |
| | | 20 | 1 | 0.874841 |
--------------------------------------------
And after the operation, I want this:
| Aircraft | DC | Test | Record | Value | v |
|----------|----|------|--------|----------|----------|
| 1.0 | A | 10 | 1 | 0.847576 | 0.961791 |
| | | | 2 | 0.860720 | 0.046681 |
| | | | 3 | 0.017704 | 0.913453 |
| | | | 4 | 0.082040 | 0.495924 |
| | | | 5 | 0.583630 | 0.149950 |
| | | | 6 | 0.506363 | 0.708635 |
| | | 20 | 1 | 0.844716 | 0.874841 |
| | B | 10 | 1 | 0.698131 | 0.961791 |
| | | | 2 | 0.112444 | 0.046681 |
| | | | 3 | 0.718316 | 0.913453 |
| | | | 4 | 0.797613 | 0.495924 |
| | | | 5 | 0.129207 | 0.149950 |
| | | | 6 | 0.861329 | 0.708635 |
| | | 20 | 1 | 0.535628 | 0.874841 |
| | C | 10 | 1 | 0.121704 | 0.961791 |

Edit:
as you are on pandas 0.23.4, you just change droplevel to reset_index with option drop=True
df_result = (df.reset_index('DC').assign(v=v.reset_index('DC', drop=True))
.set_index('DC', append=True)
.reorder_levels(v.index.names))
Original:
One way is putting index DC of df to columns and using assign to create new column on it and reset_index and reorder_index
df_result = (df.reset_index('DC').assign(v=v.droplevel('DC'))
.set_index('DC', append=True)
.reorder_levels(v.index.names))
Out[1588]:
Value v
Aircraft DC Test Record
1.0 A 10 1 0.847576 0.961791
2 0.860720 0.046681
3 0.017704 0.913453
4 0.082040 0.495924
5 0.583630 0.149950
6 0.506363 0.708635
20 1 0.844716 0.874841
B 10 1 0.698131 0.961791
2 0.112444 0.046681
3 0.718316 0.913453
4 0.797613 0.495924
5 0.129207 0.149950
6 0.861329 0.708635
20 1 0.535628 0.874841
C 10 1 0.121704 0.961791

calculate difference of column values for sets of row indices which are not successive in pandas

Say I have the following table:
+----+---------+--------+---------+---------+---------+---------+-----------+-----------+-----------+----------+-----------+------------+------------+---------+---+
| 1 | 0.72694 | 1.4742 | 0.32396 | 0.98535 | 1 | 0.83592 | 0.0046566 | 0.0039465 | 0.04779 | 0.12795 | 0.016108 | 0.0052323 | 0.00027477 | 1.1756 | 1 |
| 2 | 0.74173 | 1.5257 | 0.36116 | 0.98152 | 0.99825 | 0.79867 | 0.0052423 | 0.0050016 | 0.02416 | 0.090476 | 0.0081195 | 0.002708 | 7.48E-05 | 0.69659 | 1 |
| 3 | 0.76722 | 1.5725 | 0.38998 | 0.97755 | 1 | 0.80812 | 0.0074573 | 0.010121 | 0.011897 | 0.057445 | 0.0032891 | 0.00092068 | 3.79E-05 | 0.44348 | 1 |
| 4 | 0.73797 | 1.4597 | 0.35376 | 0.97566 | 1 | 0.81697 | 0.0068768 | 0.0086068 | 0.01595 | 0.065491 | 0.0042707 | 0.0011544 | 6.63E-05 | 0.58785 | 1 |
| 5 | 0.82301 | 1.7707 | 0.44462 | 0.97698 | 1 | 0.75493 | 0.007428 | 0.010042 | 0.0079379 | 0.045339 | 0.0020514 | 0.00055986 | 2.35E-05 | 0.34214 | 1 |
| 7 | 0.82063 | 1.7529 | 0.44458 | 0.97964 | 0.99649 | 0.7677 | 0.0059279 | 0.0063954 | 0.018375 | 0.080587 | 0.0064523 | 0.0022713 | 4.15E-05 | 0.53904 | 1 |
| 8 | 0.77982 | 1.6215 | 0.39222 | 0.98512 | 0.99825 | 0.80816 | 0.0050987 | 0.0047314 | 0.024875 | 0.089686 | 0.0079794 | 0.0024664 | 0.00014676 | 0.66975 | 1 |
| 9 | 0.83089 | 1.8199 | 0.45693 | 0.9824 | 1 | 0.77106 | 0.0060055 | 0.006564 | 0.0072447 | 0.040616 | 0.0016469 | 0.00038812 | 3.29E-05 | 0.33696 | 1 |
| 11 | 0.7459 | 1.4927 | 0.34116 | 0.98296 | 1 | 0.83088 | 0.0055665 | 0.0056395 | 0.0057679 | 0.036511 | 0.0013313 | 0.00030872 | 3.18E-05 | 0.25026 | 1 |
| 12 | 0.79606 | 1.6934 | 0.43387 | 0.98181 | 1 | 0.76985 | 0.0077992 | 0.011071 | 0.013677 | 0.057832 | 0.0033334 | 0.00081648 | 0.00013855 | 0.49751 | 1 |
+----+---------+--------+---------+---------+---------+---------+-----------+-----------+-----------+----------+-----------+------------+------------+---------+---+
I have two sets of row indices :
set1 = [1,3,5,8,9]
set2 = [2,4,7,10,10]
Note : Here, I have indicated the first row with index value 1. Length of both sets shall always be same.
What I am looking for is a fast and pythonic way to get the difference of column values for corresponding row indices, that is : difference of 1-2,3-4,5-7,8-10,9-10.
For this example, my resultant dataframe is the following:
+---+---------+--------+---------+---------+---------+---------+-----------+-----------+-----------+----------+-----------+------------+------------+---------+---+
| 1 | 0.01479 | 0.0515 | 0.0372 | 0.00383 | 0.00175 | 0.03725 | 0.0005857 | 0.0010551 | 0.02363 | 0.037474 | 0.0079885 | 0.0025243 | 0.00019997 | 0.47901 | 0 |
| 1 | 0.02925 | 0.1128 | 0.03622 | 0.00189 | 0 | 0.00885 | 0.0005805 | 0.0015142 | 0.004053 | 0.008046 | 0.0009816 | 0.00023372 | 0.0000284 | 0.14437 | 0 |
| 3 | 0.04319 | 0.1492 | 0.0524 | 0.00814 | 0.00175 | 0.05323 | 0.0023293 | 0.0053106 | 0.0169371 | 0.044347 | 0.005928 | 0.00190654 | 0.00012326 | 0.32761 | 0 |
| 3 | 0.03483 | 0.1265 | 0.02306 | 0.00059 | 0 | 0.00121 | 0.0017937 | 0.004507 | 0.0064323 | 0.017216 | 0.0016865 | 0.00042836 | 0.00010565 | 0.16055 | 0 |
| 1 | 0.05016 | 0.2007 | 0.09271 | 0.00115 | 0 | 0.06103 | 0.0022327 | 0.0054315 | 0.0079091 | 0.021321 | 0.0020021 | 0.00050776 | 0.00010675 | 0.24725 | 0 |
+---+---------+--------+---------+---------+---------+---------+-----------+-----------+-----------+----------+-----------+------------+------------+---------+---+
My resultant difference values are absolute here.
I cant apply diff(), since the row indices may not be consecutive.
I am currently achieving my aim via looping through sets.
Is there a pandas trick to do this?

Use loc based indexing -
df.loc[set1].values - df.loc[set2].values
Ensure that len(set1) is equal to len(set2). Also, keep in mind setX is a counter-intuitive name for list objects.

You need to select by data reindexing and then subtract:
df = df.reindex(set1) - df.reindex(set2).values
loc or iloc will raise a future warning, since passing list-likes to .loc or [] with any missing label will raise KeyError in the future.

In short, try the following:
df.iloc[::2].values - df.iloc[1::2].values
PS:
Or alternatively, if (like in your question the indices follow no simple rule):
df.iloc[set1].values - df.iloc[set2].values

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Rolling Status from Pandas Dataframe - python

Related

Match dtypes of two DataFrames that share columns

shift below cells to count for R

Add columns to DataFrame with difference of specific columns based on values of another column

Multi-Index Lookup Mapping

calculate difference of column values for sets of row indices which are not successive in pandas

Categories

Resources