Replacing values with the next unique one - python

In my pandas dataframe I have a column of non-unique values
I want to add a second column that contains the next unique value
i.e,
col
1
5
5
2
2
4
col addedCol
1 5
5 2
5 2
2 4
2 4
4 (last value doesn't matter)
how can i achieve this using pandas?
I'll clarify what I meant, I want each row to contain the next value that is different than of that row's
I hope I better explained myself now

IIUC, you need the next value which is different from the current value.
df.loc[:, 'col2'] = df.drop_duplicates().shift(-1).col
df['col2'].ffill(inplace=True)
col col2
0 1 5.0
1 5 2.0
2 5 2.0
3 2 2.0
(Notice that last 2.0 value doesn't matter). As suggest by #MartijnPieters,
df['col2'] = df['col2'].astype(int)
Can make values back to original integers if needed.
Adding another good solution from #piRSquared
df.assign(addedcol=df.index.to_series().shift(-1).map(df.col.drop_duplicates()).bfill())
col addedcol
0 1 5.0
1 5 2.0
2 5 2.0
3 2 NaN
Another example, if df is
col
0 1
1 5
2 5
3 2
4 3
5 3
6 10
7 9
Then
df.loc[:, 'col2'] = df.drop_duplicates().shift(-1).col
df = df.ffill()
yields
col col2
0 1 5.0
1 5 2.0
2 5 2.0
3 2 3.0
4 3 10.0
5 3 10.0
6 10 9.0
7 9 9.0

Using factorize
s=pd.factorize(df.col)[0]
pd.Series(s+1).map(dict(zip(s,df.col)))
Out[242]:
0 5.0
1 2.0
2 2.0
3 NaN
dtype: float64
#df['newadd']=pd.Series(s+1).map(dict(zip(s,df.col))).values
Under Mart 's condition
s=df.col.diff().ne(0).cumsum()
(s+1).map(dict(zip(s,df.col)))
Out[260]:
0 5.0
1 2.0
2 2.0
3 4.0
4 4.0
5 5.0
6 NaN
7 NaN
Name: col, dtype: float64

Setup
Added additional data with multiple clusters
df = pd.DataFrame({'col': [*map(int, '1552554442')]})
Two interpretations
We have to consider when there exist non-contiguous clusters
df
col
0 1 # First instance of `1` Next unique is `5`
1 5 # First instance of `5` Next unique is `2`
2 5 # Next unique is `2`
3 2 # First instance of `2` Next unique is `4` because `5` is not new
4 5 # Next unique is `4`
5 5 # Next unique is `4`
6 4 # First instance of `4` Next unique is null
7 4 # First instance of `4` Next unique is null
8 4 # First instance of `4` Next unique is null
9 2 # Second time seen `2` Should Next unique be null or what it was before `4`
Allowed to look back
Use factorize and add 1. This is very much in the spirit of #Wen's answer
i, u = df.col.factorize()
u_ = np.append(u, -1) # Append an integer value to represent null
df.assign(addedcol=u_[i + 1])
col addedcol
0 1 5
1 5 2
2 5 2
3 2 4
4 5 2
5 5 2
6 4 -1
7 4 -1
8 4 -1
9 2 4
Only Forward
Similar to before except we'll track the cumulative maximum factorized value
i, u = df.col.factorize()
u_ = np.append(u, -1) # Append an integer value to represent null
x = np.maximum.accumulate(i)
df.assign(addedcol=u_[x + 1])
col addedcol
0 1 5
1 5 2
2 5 2
3 2 4
4 5 4
5 5 4
6 4 -1
7 4 -1
8 4 -1
9 2 -1
You'll notice that the difference is in the last value. When we can only look forward, we see that there is no next unique value.

Related

How can I expand masked intervals in a timeseries dataframe to adjacent rows?

I have a timeseries dataset that has intervals of bad measurements. I clean the data by using df.mask() to reject the bad measurements that are above or below a threshold. However, I'm concerned that part of the adjacent intervals are impacted by bad measurements as well, but not enough to exceed the threshold. To be safe, I'd like to also mask these adjacent intervals as well.
For example:
>>> df
seconds value
0 1 5
1 2 2
2 3 -1
3 4 -3
4 5 2
5 6 4
6 7 6
>>> # Mask the negative values because we know those are bad measurements
>>> df["good value"] = df["value"].mask(lambda x: x < 0)
>>> df
seconds value good value
0 1 5 5.0
1 2 2 2.0 # <--- want to mask as well
2 3 -1 NaN
3 4 -3 NaN
4 5 2 2.0 # <--- want to mask as well
5 6 4 4.0
6 7 6 6.0
How can I expand any blocks of masked values into one or two adjacent rows?
You can shift the mask to adjacent rows
df["good value"] = df["value"].mask(df["value"].lt(0) | df["value"].lt(0).shift(-1) | df["value"].lt(0).shift())
print(df)
seconds value good value
0 1 5 5.0
1 2 2 NaN
2 3 -1 NaN
3 4 -3 NaN
4 5 2 NaN
5 6 4 4.0
6 7 6 6.0

Closest non equal row in a column in Pandas dataframe

I have this df
d={}
d['id']=['1','1','1','1','1','1','1','1','2','2','2','2','2','2','2','2']
d['qty']=[5,5,5,5,5,6,5,5,1,1,2,2,2,3,5,8]
I would like to create a column that is going to have the following non-equal value of column qty. Meaning that if qty is equal to 5 and its next row is 5 I am going to skip it and look until I find next value not equal to 5, In my case it is 6. And all this should be grouped by id
Here is the desired dataframe.
d['id']=['1','1','1','1','1','1','1','1','2','2','2','2','2','2','2','2']
d['qty']=[5,5,5,5,5,6,5,5,1,1,2,2,2,3,5,8]
d['qty2']=[6,6,6,6,6,5,'NAN','NAN',2,2,3,3,3,5,8,'NAN']
Any help is very much appreciated
You can groupby.shift, mask the identical values, and groupby.bfill:
# shift up per group
s = df.groupby('id')['qty'].shift(-1)
# keep only the different values and bfill per group
df['qty2'] = s.where(df['qty'].ne(s)).groupby(df['id']).bfill()
output:
id qty qty2
0 1 5 6.0
1 1 5 6.0
2 1 5 6.0
3 1 5 6.0
4 1 5 6.0
5 1 6 5.0
6 1 5 NaN
7 1 5 NaN
8 2 1 2.0
9 2 1 2.0
10 2 2 3.0
11 2 2 3.0
12 2 2 3.0
13 2 3 5.0
14 2 5 8.0
15 2 8 NaN

Create a column that contains the sum of rows above within group

Here's the dataset I've got. Basically I would like to create a column containing the sum of the values before the date (which means the sum of the values that is above the row) within the same group. So the first row of each group is supposed to be always 0.
group
date
value
1
10/04/2022
2
1
12/04/2022
3
1
17/04/2022
5
1
22/04/2022
1
2
11/04/2022
3
2
15/04/2022
2
2
17/04/2022
4
The column I want would look like this.
Could you give me an idea how to create such a column?
group
date
value
sum
1
10/04/2022
2
0
1
12/04/2022
3
2
1
17/04/2022
5
5
1
22/04/2022
1
10
2
11/04/2022
3
0
2
15/04/2022
2
3
2
17/04/2022
4
5
You can try groupby.transform and call Series.cumsum().shift()
df['sum'] = (df
# sort the dataframe if needed
.assign(date=pd.to_datetime(df['date'], dayfirst=True))
.sort_values(['group', 'date'])
.groupby('group')['value']
.transform(lambda col: col.cumsum().shift())
.fillna(0))
print(df)
group date value sum
0 1 10/04/2022 2 0.0
1 1 12/04/2022 3 2.0
2 1 17/04/2022 5 5.0
3 1 22/04/2022 1 10.0
4 2 11/04/2022 3 0.0
5 2 15/04/2022 2 3.0
6 2 17/04/2022 4 5.0

Drop data-frame Raws not included in the interval between other two columns

I need to drop observations not included in an interval (whose limits are contained in other two columns) and substitute NaN values with mean or median. I think I should use an if with three condition but I'm not so confident with data-frame.
Data-frame example:
col1 lower_bound upper_bound
3 2 6
1 2 6
3 2 6
5 2 6
8 2 6
4 2 6
NaN 2 6
desired output example:
col1 lower_bound upper_bound
3 2 6
3 2 6
5 2 6
4 2 6
mean/mdn 2 6
Thank you in advance for your help!
You can do this in 2 steps: fillna to fill the NaN with the mean or median, and indexing using between or 2 conditions to get the rows where col1 are between your bounds
# Fill NaN in col1 with the mean
df.col1.fillna(df.col1.mean(),inplace=True)
# or with the median
# df.col1.fillna(df.col1.median(),inplace=True)
# Index based on your conditions:
df[df.col1.between(df.lower_bound, df.upper_bound)]
# or:
#df[(df.col1 > df.lower_bound) & (df.col1 < df.upper_bound)]
col1 lower_bound upper_bound
0 3.0 2 6
2 3.0 2 6
3 5.0 2 6
5 4.0 2 6
6 4.0 2 6

Conditional sum from rows into a new column in pandas

I am looking to create a new column in panda based on the value in the row. My sample data:
df=pd.DataFrame({"A":['a','a','a','a','a','a','b','b','b'],
"Sales":[2,3,7,1,4,3,5,6,9,10,11,8,7,13,14],
"Week":[1,2,3,4,5,11,1,2,3,4])
I want a new column "Last3WeekSales" corresponding to each week, having the sum of sales for the previous 3 weeks.
NOTE: Shift() won't work here as data for some weeks is missing.
Logic which I thought:
Checking the week no. in each row, then summing up the data from w-1, w-2, w-3.
Output required:
A Week Last3WeekSales
0 a 1 0
1 a 2 2
2 a 3 5
3 a 4 12
4 a 5 11
5 a 11 0
6 b 1 0
7 b 2 5
8 b 3 11
9 b 4 20
Use groupby, shift and rolling:
df['Last3WeekSales'] = df.groupby('A')['Sales']\
.apply(lambda x: x.shift(1)
.rolling(3, min_periods=1)
.sum())\
.fillna(0)
Output:
A Sales Week Last3WeekSales
0 a 2 1 0.0
1 a 3 2 2.0
2 a 7 3 5.0
3 a 1 4 12.0
4 a 4 5 11.0
5 a 3 6 12.0
6 b 5 1 0.0
7 b 6 2 5.0
8 b 9 3 11.0
you can use pandas.rolling_sum to sum over 3 last values, and shift(n) to shift your column by n times (1 in your case).
if we suppose you a column 'sales' with the sales of each week, the code would be :
df["Last3WeekSales"] = df.groupby("A")["sales"].apply(lambda x: pd.rolling_sum(x.shoft(1),3))

Categories

Resources