I have a timeseries dataset that has intervals of bad measurements. I clean the data by using df.mask() to reject the bad measurements that are above or below a threshold. However, I'm concerned that part of the adjacent intervals are impacted by bad measurements as well, but not enough to exceed the threshold. To be safe, I'd like to also mask these adjacent intervals as well.
For example:
>>> df
seconds value
0 1 5
1 2 2
2 3 -1
3 4 -3
4 5 2
5 6 4
6 7 6
>>> # Mask the negative values because we know those are bad measurements
>>> df["good value"] = df["value"].mask(lambda x: x < 0)
>>> df
seconds value good value
0 1 5 5.0
1 2 2 2.0 # <--- want to mask as well
2 3 -1 NaN
3 4 -3 NaN
4 5 2 2.0 # <--- want to mask as well
5 6 4 4.0
6 7 6 6.0
How can I expand any blocks of masked values into one or two adjacent rows?
You can shift the mask to adjacent rows
df["good value"] = df["value"].mask(df["value"].lt(0) | df["value"].lt(0).shift(-1) | df["value"].lt(0).shift())
print(df)
seconds value good value
0 1 5 5.0
1 2 2 NaN
2 3 -1 NaN
3 4 -3 NaN
4 5 2 NaN
5 6 4 4.0
6 7 6 6.0
Lets say we want to compute the variable D in the dataframe below based on time values in variable B and C.
Here, second row of D is C2 - B1, the difference is 4 minutes and
third row = C3 - B2= 4 minutes,.. and so on.
There is no reference value for first row of D so its NA.
Issue:
We also want a NA value for the first row when the category value in variable A changes from 1 to 2. In other words, the value -183 must be replaced by NA.
A B C D
1 5:43:00 5:24:00 NA
1 6:19:00 5:47:00 4
1 6:53:00 6:23:00 4
1 7:29:00 6:55:00 2
1 8:03:00 7:31:00 2
1 8:43:00 8:05:00 2
2 6:07:00 5:40:00 -183
2 6:42:00 6:11:00 4
2 7:15:00 6:45:00 3
2 7:53:00 7:17:00 2
2 8:30:00 7:55:00 2
2 9:07:00 8:32:00 2
2 9:41:00 9:09:00 2
2 10:17:00 9:46:00 5
2 10:52:00 10:20:00 3
You can use:
# Compute delta
df['D'] = (pd.to_timedelta(df['C']).sub(pd.to_timedelta(df['B'].shift()))
.dt.total_seconds().div(60))
# Fill nan
df.loc[df['A'].ne(df['A'].shift()), 'D'] = np.nan
Output:
>>> df
A B C D
0 1 5:43:00 5:24:00 NaN
1 1 6:19:00 5:47:00 4.0
2 1 6:53:00 6:23:00 4.0
3 1 7:29:00 6:55:00 2.0
4 1 8:03:00 7:31:00 2.0
5 1 8:43:00 8:05:00 2.0
6 2 6:07:00 5:40:00 NaN
7 2 6:42:00 6:11:00 4.0
8 2 7:15:00 6:45:00 3.0
9 2 7:53:00 7:17:00 2.0
10 2 8:30:00 7:55:00 2.0
11 2 9:07:00 8:32:00 2.0
12 2 9:41:00 9:09:00 2.0
13 2 10:17:00 9:46:00 5.0
14 2 10:52:00 10:20:00 3.0
You can use the difference between datetime columns in pandas.
Having
df['B_dt'] = pd.to_datetime(df['B'])
df['C_dt'] = pd.to_datetime(df['C'])
Makes the following possible
>>> df['D'] = (df.groupby('A')
.apply(lambda s: (s['C_dt'] - s['B_dt'].shift()).dt.seconds / 60)
.reset_index(drop=True))
You can always drop these new columns later.
I'm trying to group several group of columns to count or sum the rows in a pandas dataframe
I've checked many questions already and the most similar I found is this one > Groupby sum and count on multiple columns in python, but, by what I understand I have to do many steps to reach my goal. and was also looking at this link
As an example, I have the dataframe below:
import numpy as np
df = pd.DataFrame(np.random.randint(0,5,size=(5, 7)), columns=["grey2","red1","blue1","red2","red3","blue2","grey1"])
grey2 red1 blue1 red2 red3 blue2 grey1
0 4 3 0 2 4 0 2
1 4 2 0 4 0 3 1
2 1 1 3 1 1 3 1
3 4 4 1 4 1 1 1
4 3 4 1 0 3 3 1
I want to group here, all the columns by colour, for example, and what I would expect is:
If I sum the numbers,
blue 15
grey 22
red 34
If I count ( x > 0 ) then I will get,
blue 7
grey 10
red 13
this is what I have achieved so far, so now i will have to sum and then create a dataframe with the results, but if I have 100 groups,this would be very time consuming.
pd.pivot_table(data=df, index=df.index, values=["red1","red2","red3"], aggfunc='sum', margins=True)
red1 red2 red3
0 3 2 4
1 2 4 0
2 1 1 1
3 4 4 1
4 4 0 3
ALL 14 11 9
pd.pivot_table(data=df, index=df.index, values=["red1","red2","red3"], aggfunc='count', margins=True)
But here is also counting the zeros:
red1 red2 red3
0 1 1 1
1 1 1 1
2 1 1 1
3 1 1 1
4 1 1 1
All 5 5 5
Not sure how to alter the function to get my results, and I've already spend hours, hopefully you can help.
NOTE:
I only use colours in this example to simplify the case, but I could have around many columns and they are called col001 till col300, etc...
So, the groups could be:
blue = col131, col254, col005
red = col023, col190, col053
and so on.....
You can use pd.wide_to_long:
data= pd.wide_to_long(df.reset_index(), stubnames=['grey','red','blue'],
i='index',
j='group',
sep=''
)
Output:
# data
grey red blue
index group
0 1 2.0 3 0.0
2 4.0 2 0.0
3 NaN 4 NaN
1 1 1.0 2 0.0
2 4.0 4 3.0
3 NaN 0 NaN
2 1 1.0 1 3.0
2 1.0 1 3.0
3 NaN 1 NaN
3 1 1.0 4 1.0
2 4.0 4 1.0
3 NaN 1 NaN
4 1 1.0 4 1.0
2 3.0 0 3.0
3 NaN 3 NaN
And:
data.sum()
# grey 22.0
# red 34.0
# blue 15.0
# dtype: float64
data.gt(0).sum()
# grey 10
# red 13
# blue 7
# dtype: int64
Update wide_to_long is just a convenient shortcut for merge and rename. So if you have a dictionary {cat:[col_list]}, you could resolve to that:
groups = {'blue' : ['col131', 'col254', 'col005'],
'red' : ['col023', 'col190', 'col053']}
# create the inverse dictionary for mapping
inv_group = {v:k for k,v in groups.items()}
data = df.melt()
# map the original columns to group
data['group'] = data['variable'].map(inv_group)
# from now on, it's similar to other answers
# sum
data.groupby('group')['value'].sum()
# count
data['value'].gt(0).groupby(data['group']).sum()
The complication here is that you want to collapse both by rows and columns, which is generally difficult to do at the same time. We can melt to go from your wide format to a longer format, which then reduces the problem to a single groupby
# Get rid of the numbers + reshape
df.columns = pd.Index(df.columns.str.rstrip('0123456789'), name='color')
df = df.melt()
df.groupby('color').sum()
# value
#color
#blue 15
#grey 22
#red 34
df.value.gt(0).groupby(df.color).sum()
#color
#blue 7.0
#grey 10.0
#red 13.0
#Name: value, dtype: float64
With names that are less simple to group, we'd need to have the mapping somewhere, the steps are very similar:
# Unnecessary in this case, but more general
d = {'grey1': 'color_1', 'grey2': 'color_1',
'red1': 'color_2', 'red2': 'color_2', 'red3': 'color_2',
'blue1': 'color_3', 'blue2': 'color_3'}
df.columns = pd.Index(df.columns.map(d), name='color')
df = df.melt()
df.groupby('color').sum()
# value
#color
#color_1 22
#color_2 34
#color_3 15
Use:
df.groupby(df.columns.str.replace('\d+', ''),axis=1).sum().sum()
Output:
blue 15
grey 22
red 34
dtype: int64
this works regardless of the number of digits contained in the name of the columns:
df=df.add_suffix('22')
print(df)
grey22222 red12222 blue12222 red22222 red32222 blue22222 grey12222
0 4 3 0 2 4 0 2
1 4 2 0 4 0 3 1
2 1 1 3 1 1 3 1
3 4 4 1 4 1 1 1
4 3 4 1 0 3 3 1
df.groupby(df.columns.str.replace('\d+', ''),axis=1).sum().sum()
blue 15
grey 22
red 34
dtype: int64
You could also do something like this for the general case:
colors = {'blue':['blue1','blue2'], 'red':['red1','red2','red3'], 'grey':['grey1','grey2']}
orig_columns = df.columns
df.columns = [key for col in df.columns for key in colors.keys() if col in colors[key]]
print(df.groupby(level=0,axis=1).sum().sum())
df.columns = orig_columns
Given a dataframe df, I would like to generate a new variable/column for each row based on the values in the previous n rows (for example previous 3).
For example, given the following:
INPUT
A B C
10 2 59.4
53 3 71.5
32 2 70.4
24 3 82.1
Calculation for D: if in the actual row in C or previous 3 rows in C there are 2 or more cells > 70 then 1, else 0
OUTPUT
A B C D
10 2 59.4 0
53 3 71.5 0
32 2 70.4 1
24 3 82.1 1
How should I do it in pandas?
IIUC, should use rolling and build your logic in the apply
window = 3
df.C.rolling(window).apply(lambda s: 1 if (s>=70).size >= 2 else 0)
0 NaN
1 NaN
2 1.0
3 1.0
You can also fillna to turn NaNs into 0
.fillna(0)
0 0.0
1 0.0
2 1.0
3 1.0
I think #RafaelC's answer is the right approach. I'm adding an answer to (a) provide better example data that covers edge cases and (b) to adjust #RafaelC's syntax slightly. In particular:
min_periods = 1 allows for early rows whose index values are smaller than the window to be non-NaN
window = 4 allows for the current entry plus the previous 3 to be considered
Use sum() instead of size to get only True values
Updated code:
window = 4
df.C.rolling(window, min_periods=1).apply(lambda x: (x>70).sum()>=2)
Data:
A B C
10 2 59.4
53 3 71.5
32 2 70.4
24 3 82.1
11 4 10.1
10 5 1.0
12 3 2.3
13 2 1.1
99 9 70.2
12 9 80.0
Expected output according to OP rules:
0 0.0
1 0.0
2 1.0
3 1.0
4 1.0
5 1.0
6 0.0
7 0.0
8 0.0
9 1.0
Name: C, dtype: float64
I have a MultiIndex Series (3 indices) that looks like this:
Week ID_1 ID_2
3 26 1182 39.0
4767 42.0
31393 20.0
31690 42.0
32962 3.0
....................................
I also have a dataframe df which contains all the columns (and more) used for indices in the Series above, and I want to create a new column in my dataframe df that contains the value matching the ID_1 and ID_2 and the Week - 2 from the Series.
For example, for the row in dataframe that has ID_1 = 26, ID_2 = 1182 and Week = 3, I want to match the value in the Series indexed by ID_1 = 26, ID_2 = 1182 and Week = 1 (3-2) and put it on that row in a new column. Further, my Series might not necessarily have the value required by the dataframe, in which case I'd like to just have 0.
Right now, I am trying to do this by using:
[multiindex_series.get((x[1].get('week', 2) - 2, x[1].get('ID_1', 0), x[1].get('ID_2', 0))) for x in df.iterrows()]
This however is very slow and memory hungry and I was wondering what are some better ways to do this.
FWIW, the Series was created using
saved_groupby = df.groupby(['Week', 'ID_1', 'ID_2'])['Target'].median()
and I'm willing to do it a different way if better paths exist to create what I'm looking for.
Increase the Week by 2:
saved_groupby = df.groupby(['Week', 'ID_1', 'ID_2'])['Target'].median()
saved_groupby = saved_groupby.reset_index()
saved_groupby['Week'] = saved_groupby['Week'] + 2
and then merge df with saved_groupby:
result = pd.merge(df, saved_groupby, on=['Week', 'ID_1', 'ID_2'], how='left')
This will augment df with the target median from 2 weeks ago.
To make the median (target) saved_groupby column 0 when there is no match, use fillna to change NaNs to 0:
result['Median'] = result['Median'].fillna(0)
For example,
import numpy as np
import pandas as pd
np.random.seed(2016)
df = pd.DataFrame(np.random.randint(5, size=(20,5)),
columns=['Week', 'ID_1', 'ID_2', 'Target', 'Foo'])
saved_groupby = df.groupby(['Week', 'ID_1', 'ID_2'])['Target'].median()
saved_groupby = saved_groupby.reset_index()
saved_groupby['Week'] = saved_groupby['Week'] + 2
saved_groupby = saved_groupby.rename(columns={'Target':'Median'})
result = pd.merge(df, saved_groupby, on=['Week', 'ID_1', 'ID_2'], how='left')
result['Median'] = result['Median'].fillna(0)
print(result)
yields
Week ID_1 ID_2 Target Foo Median
0 3 2 3 4 2 0.0
1 3 3 0 3 4 0.0
2 4 3 0 1 2 0.0
3 3 4 1 1 1 0.0
4 2 4 2 0 3 2.0
5 1 0 1 4 4 0.0
6 2 3 4 0 0 0.0
7 4 0 0 2 3 0.0
8 3 4 3 2 2 0.0
9 2 2 4 0 1 0.0
10 2 0 4 4 2 0.0
11 1 1 3 0 0 0.0
12 0 1 0 2 0 0.0
13 4 0 4 0 3 4.0
14 1 2 1 3 1 0.0
15 3 0 1 3 4 2.0
16 0 4 2 2 4 0.0
17 1 1 4 4 2 0.0
18 4 1 0 3 0 0.0
19 1 0 1 0 0 0.0