Smoothing Categorical Output - python

I have a list of outputs obtained from a cow behavior detection model. Even in a video when a cow is laying, often time it identifies as standing and vice versa. In each video frame, a classification result is given by the model and we are appending it into a list. Let's assume after 20 frames, we have a series of output as follows -
behavious_cow_1 = ["stand","stand","stand","stand","lying", "stand","stand", "eating", "stand","stand","stand","stand","lying""stand","stand","stand","stand","stand","stand","lying"]
Out of 20 classification results, we have 4 misclassification; 3 lyings, and 1 eating. However, the whole time the cow was sitting at a place. If the list only contained numerical values like - 1,2,3..., I would have opted for moving average to change the misclassification. Is there any Scipy, Pandas, Numpy function that can smooth the categorical output? I am thinking about taking previous 3 and next 3 values to determine the current category.

I used the following solution -
import scipy.stats
window_length = 7
behave = ["stand","stand","stand","stand","lying","lying", "eating"]
most_freq_val = lambda x: scipy.stats.mode(x)[0][0]
smoothed = [most_freq_val(behave[i:i+window_length]) for i in range(0,len(behave)-window_length+1)]
I tried the solution posted by Hugolmn but it broke at a point. In the rolling mode, the window width is provided by the user (7 here). In a certain width, if more than one values are present in the same number of times, the code does not work. It's more like - you tried to find the statistical mode (most common item) of a list but it got more than one item with the same highest frequency.

I am myself very surprised that a function such as mode() does not work with a rolling window in pandas. However, I still found a decent solution to your problem
First, create a pandas Series with categorical datatype:
df = pd.Series(sample, dtype='category')
Now you can see that df.cat.categories returns the list of categories in your data, and df.cat.codes the codes associated to them. We can use the latter to apply a rolling mode with a width of 7 (3 previous, the value, and the next 3):
df.cat.codes
0 3
1 3
2 3
3 3
4 1
5 3
6 3
7 0
8 3
9 3
10 3
11 3
12 2
13 3
14 3
15 3
16 3
17 3
18 1
dtype: int8
df.cat.codes.rolling(7, center=True, min_periods=0).apply(lambda x: x.mode())
0 3.0
1 3.0
2 3.0
3 3.0
4 3.0
5 3.0
6 3.0
7 3.0
8 3.0
9 3.0
10 3.0
11 3.0
12 3.0
13 3.0
14 3.0
15 3.0
16 3.0
17 3.0
18 3.0
dtype: float64
Finally, you can map the codes to get the strings back:
(df.cat.codes
.rolling(7, center=True, min_periods=0)
.apply(lambda x: x.mode())
.map(dict(enumerate(df.cat.categories)))
)
0 stand
1 stand
2 stand
3 stand
4 stand
5 stand
6 stand
7 stand
8 stand
9 stand
10 stand
11 stand
12 stand
13 stand
14 stand
15 stand
16 stand
17 stand
18 stand
dtype: object
And there you go ! You recovered your strings after applying a rolling mode on their codes !

Related

Using Pandas how can I split a number into categories or tiers

I would like to split a column, representing a number of days that attract different charges, into tiers. For example: A rail-car is sitting on a customer's location for 11 days. Typically the customer will get 5 "free" days to unload the car, then another 5 days at $50/day, and then all other days may be $75 per day.
So in this case I would like to generate the tier.X.days like:
days
tier.1.threshold
tier.2.threshold
tier.3.threshold
tier.1.days
tier.2.days
tier.3.days
11
5
5
NaN
5
5
1
3
5
5
NaN
3
0
0
7
5
5
NaN
5
2
0
7
3
3
NaN
3
3
1
From a table with just the days and thresholds. Wondering if there is an elegant way of doing it. The number of tiers is variable as well but I can generate those in a different way. In the end I would like to easily generate tier.X.charge which would be tier.X.days times tier.X.rate (the 5 days X $50/days for example).
You can use cumsum to calculate the accumulate thresholds, then subtract them off the days, clipping negative values to 0:
# extract the thresholds
thresh = df.filter(like='threshold')
# cumsum of the threshold
cumsums = thresh.cumsum(1).shift(1,fill_value=0,axis=1)
# subtract the accumulate thresholds from the days
# then clipping,
out = np.minimum(-cumsums.sub(df['days'], axis='rows'), thresh.fillna(cumsums)).clip(0)
# also
# out = (-cumsums.sub(df['days'], axis='rows')).clip(lower=0, upper=thresh.fillna(cumsums))
Output:
tier.1.threshold tier.2.threshold tier.3.threshold
0 5.0 5.0 1.0
1 3.0 0.0 0.0
2 5.0 2.0 0.0
3 3.0 3.0 1.0

Python : How to assign ranks to categorical variables within a group in Python

Given I have a dataset containing only the first two columns, how do I create another column using Python which will contain the rank based on these ranges for each group separately. My desired output would look like this -
id
range
rank
1
10-20
2
1
20-30
3
1
5-10
1
2
20-30
2
2
10-20
1
2
3
10-20
2
3
5-10
1
3
20-30
3
3
30+
4
NOTE - These are the only 4 ranges [5-10, 10-20, 20-30, 30+] that can belong to any id at max. There can be blanks as well For example as given in the reproducible example, if for id 2 there are two ranges 10-20 and 20-30 the corresponding to 10-20 the rank will be 1 and corresponding to 20-30 the rank will be 2. I have checked that df.groupby can be used but I am not being able to figure out how in this case.
Convert your range column to a category dtype before apply rank:
df['range'] = df['range'].astype(pd.CategoricalDtype(
['5-10', '10-20', '20-30', '30+'], ordered=True))
df['rank'] = df.groupby('id')['range'].apply(lambda x: x.rank())
>>> df
id range rank
0 1 10-20 2.0
1 1 20-30 3.0
2 1 5-10 1.0
3 2 20-30 2.0
4 2 10-20 1.0
5 2 NaN NaN
6 3 10-20 2.0
7 3 5-10 1.0
8 3 20-30 3.0
9 3 30+ 4.0

Pandas Frequency of subcategories in a GroupBy

I have a DataFrame as shown below:
First I would like to get the overall frequencies of the CODE values, call it FREQ, then frequencies of the CODE values within each AXLE group and call it GROUP_FREQ.
I was able to calculate the FREQ column using the below code:
pivot = df[['AXLES','CODE']].pivot(['CODE']).agg(['count','mean','min','max'])
pivot['FREQ']=grouped_df.AXLES['count']/pivot.AXLES['count'].sum()*100`
this provides a nice grouped DataFrame as shown below:
However, I could not figure out how to calculate the frequencies within each AXLE group using this grouped_by DataFrame in the next step.
I tried:
pivot['GROUPFREQ']=pivot['AXLES','mean']['count']/pivot['AXLES','mean']['count'].sum()*100
However, this gives a KeyError: 'count'.
I may be on the wrong path, and what I am trying to achieve may not be done using groupby. I decided to check with the community after spending a couple of hours of trial and errors. I'd be glad if you could let me know what you think.
Thanks!
EDIT:
Reproducible input DataFrame:
,CODE,AXLES
0,0101,5
1,001,4
2,0110111,8
3,010111,7
4,0100,5
5,0101,5
6,0110111,8
7,00111,6
8,00111,6
9,0110111,8
10,0100,5
11,0110011,8
12,01011,6
13,0110111,8
14,0110111,8
15,011011,7
16,011011,7
17,011011,7
18,01011,6
19,01011,6
Desired Output for pivot DataFrame:
CODE,COUNT,AXLES,FREQ,GROUPFREQ
001,1,4,0.05,1.00
00111,2,6,0.1,0.40
0100,2,5,0.1,0.50
0101,2,5,0.1,0.50
01011,3,6,0.15,0.60
010111,1,7,0.05,0.25
0110011,1,8,0.05,0.17
011011,3,7,0.15,0.75
0110111,5,8,0.25,0.83
For the first line of the output:
001 is seen only once in the whole data set (20 records). Thus FREQ = 1/20 = 0.05
When the data is grouped by AXLES, for the AXLES=4 group, 001 is the only record, thus the GROUPFREQ = 1/1 = 1.00. (The same code cannot occur under various AXLE groups, so 001 only needs to be checked for AXLES=4.)
Do you mean:
pivot['FREQ'] = df.groupby('AXLES').CODE.value_counts(normalize=True).reset_index(level=0,drop=True)
Output:
AXLES FREQ
count mean min max
CODE
1 1 4 4 4 1.000000
100 2 5 5 5 0.500000
101 2 5 5 5 0.500000
111 2 6 6 6 0.400000
1011 3 6 6 6 0.600000
10111 1 7 7 7 0.250000
11011 3 7 7 7 0.750000
110011 1 8 8 8 0.166667
110111 5 8 8 8 0.833333

Insert row in dataframe between every existing row and it must contain information from the previous and next rows

Okay this one I'm a little bit stuck on.
I have a dataframe like this:
time Throttle Vout
0 1056.65785 1 8
1 1056.66255 2 8
2 1056.66785 3 9
3 1056.67330 4 11
4 1056.67840 5 15
and I need to add a row between every existing row - the whole dataset is about 21000 rows. The time should be equal to the time in the next row. Any other columns should have the values of the previous row.
So the outcome would be something like this:
time Throttle Vout
0 1056.65785 1 8
1 1056.66255 1 8 <---- new row
2 1056.66255 2 8
3 1056.66785 2 8 <---- new row
4 1056.66785 3 9
5 1056.67330 3 9 <---- new row
6 1056.67330 4 11
7 1056.67840 4 11 <---- new row
8 1056.67840 5 15
I've looked into df.apply() but not sure where to start
Serge Ballesta answer:
So this works with the test data supplied above. When I test it on a much larger DataFrame I start to see some errors. I originally thought it was something wrong in my PyCharm but testing with a larger dataset in powershell proved otherwise.
Quang Hoang answer:
So this also worked on a small scale but when using a larger dataset it seemed to have quite a few issues with both time and the other columns. I've highlighted some in the image below. The top df is the original and the bottom is the altered.
Valdi_Bo Answer
The additional columns seemed to work well with this but there seems to be an issue with the times columns on larger datasets. I've highlighted some below.
You can use a combination of concat and ffill:
(pd.concat([df, df[['time']].shift(-1)])
.sort_index(kind='mergesort')
.dropna(how='all')
.ffill()
)
Output:
time Throttle Vout
0 1056.65785 1.0 8.0
0 1056.66255 1.0 8.0
1 1056.66255 2.0 8.0
1 1056.66785 2.0 8.0
2 1056.66785 3.0 9.0
2 1056.67330 3.0 9.0
3 1056.67330 4.0 11.0
3 1056.67840 4.0 11.0
4 1056.67840 5.0 15.0
I would build a copy of the dataframe, shift its time column, concatenate it to the original dataframe and sort the result according to time:
df2 = df.copy()
df2['time'] = df['time'].shift(-1)
result =
df2[~df2['time'].isna()].append(df).sort_values('time').reset_index(drop=True)
It gives as expected:
time Throttle Vout
0 1056.65785 1 8
1 1056.66255 1 8
2 1056.66255 2 8
3 1056.66785 2 8
4 1056.66785 3 9
5 1056.67330 3 9
6 1056.67330 4 11
7 1056.67840 4 11
8 1056.67840 5 15
This might look a bit overwhelming, but the idea is that you merge the original dataframe with its copy whose values in Throttle & Vout columns are shifted by 1:
pd.concat([
df,
df.loc[:,'Throttle':].shift(1).combine_first(df)
]).reset_index().loc[1:,].sort_values(['time', 'Throttle'])
First compute an auxiliary DataFrame - a copy of df, with time column
shifted 1 place up and without the last original row:
df2 = df.copy()
df2.time = df2.time.shift(-1)
df2.dropna(inplace=True)
The result, for your input sample, is:
time Throttle Vout
0 1056.66255 1 8
1 1056.66785 2 8
2 1056.67330 3 9
3 1056.67840 4 11
and these are the new rows to insert.
To get a concatenation of these 2 DataFrames, in proper order, run:
df = pd.concat([df, df2], keys=[1, 2]).swaplevel().sort_index().reset_index(drop=True)
To guarantee the proper order of rows, I added to the previous solution:
keys - to add "origin indicators", but they area added as the top
level of the MultiIndex,
swaplevel - swap levels of the MulitiIndex, to provide proper sort
by index (in the next step).

Pandas rolling apply with missing data

I want to do a rolling computation on missing data.
Sample Code: (For sake of simplicity I'm giving an example of a rolling sum but I want to do something more generic.)
foo = lambda z: z[pandas.notnull(z)].sum()
x = np.arange(10, dtype="float")
x[6] = np.NaN
x2 = pandas.Series(x)
pandas.rolling_apply(x2, 3, foo)
which produces:
0 NaN
1 NaN
2 3
3 6
4 9
5 12
6 NaN
7 NaN
8 NaN
9 24
I think that during the "rolling", window with missing data is being ignored for computation. I'm looking to get a result along the lines of:
0 NaN
1 NaN
2 3
3 6
4 9
5 12
6 9
7 12
8 15
9 24
In [7]: pandas.rolling_apply(x2, 3, foo, min_periods=2)
Out[7]:
0 NaN
1 1
2 3
3 6
4 9
5 12
6 9
7 12
8 15
9 24
It would be better to replace the NA values in the data-set with logical substitutions before operating on them.
For Numerical Data:
For your given example, a simple mean around the NA would fill it perfectly, but what if x[7] = np.NaN were eliminated as well?
Analysis of the surrounding data shows a linear pattern, so a lerp(linear-interpolate) is in order.
Same goes for polynomial, exponential, log, and periodic(cosine) data.
If an inflection point, a change in the second derivative of the data(subtract pairwise points twice, and note if the sign changes), happens during the missing data, its position is unknowable unless the other side picks it up perfectly, if not, pick a random point and continue.
For Categorical Data:
from scipy import stats
Use:
x=pandas.rolling_apply(x2, 3, (lambda x : stats.mode(x,nan_policy='omit')) to replace the missing values with the most common of the nearest 3.
For Static data:
Use:
Replace 0 with the appropriate value.
x = x.fillna(0)

Categories

Resources