How to "iron out" a column of numbers with duplicates in it

How to "iron out" a column of numbers with duplicates in it - python

If one has the following column:
df = pd.DataFrame({"numbers":[1,2,3,4,4,5,1,2,2,3,4,5,6,7,7,8,1,1,2,2,3,4,5,6,6,7]})
How can one "iron" it out so that the duplicates become part of the series of numbers:
numbers new_numbers
1 1
2 2
3 3
4 4
4 5
5 6
1 1
2 2
2 3
3 4
4 5
5 6
6 7
7 8
7 9
8 10
1 1
1 2
2 3
2 4
3 5
4 6
5 7
6 8
6 9
7 10
(I put spaces into the df for clarification)

It seems you need cumcount by Series created with diff and compare with lt (<) for finding starts of each group. Groups are made by cumsum:
#for better testing helper df1
df1 = pd.DataFrame(index=df.index)
df1['dif'] = df.numbers.diff()
df1['compare'] = df.numbers.diff().lt(0)
df1['groups'] = df.numbers.diff().lt(0).cumsum()
print (df1)
dif compare groups
0 NaN False 0
1 1.0 False 0
2 1.0 False 0
3 1.0 False 0
4 0.0 False 0
5 1.0 False 0
6 -4.0 True 1
7 1.0 False 1
8 0.0 False 1
9 1.0 False 1
10 1.0 False 1
11 1.0 False 1
12 1.0 False 1
13 1.0 False 1
14 0.0 False 1
15 1.0 False 1
16 -7.0 True 2
17 0.0 False 2
18 1.0 False 2
19 0.0 False 2
20 1.0 False 2
21 1.0 False 2
22 1.0 False 2
23 1.0 False 2
24 0.0 False 2
25 1.0 False 2
df['new_numbers'] = df.groupby(df.numbers.diff().lt(0).cumsum()).cumcount() + 1
print (df)
numbers new_numbers
0 1 1
1 2 2
2 3 3
3 4 4
4 4 5
5 5 6
6 1 1
7 2 2
8 2 3
9 3 4
10 4 5
11 5 6
12 6 7
13 7 8
14 7 9
15 8 10
16 1 1
17 1 2
18 2 3
19 2 4
20 3 5
21 4 6
22 5 7
23 6 8
24 6 9
25 7 10

Related

How to find out the cumulative count between numbers?

i want to find the cumulative count before there is a change in value, i.e. how many rows since the last change. For illustration:
Value
diff
#row since last change (how do I create this column?)
6
na
na
5
-1
0
5
0
1
5
0
2
4
-1
0
4
0
1
4
0
2
4
0
3
4
0
4
5
1
0
5
0
1
5
0
2
5
0
3
6
1
0
7
1
0
i tried to use cumsum but it does not reset after each change

IIUC, use a cumcount per group:
df['new'] = df.groupby(df['Value'].ne(df['Value'].shift()).cumsum()).cumcount()
output:
Value diff new
0 6 na 0
1 5 -1 0
2 5 0 1
3 5 0 2
4 4 -1 0
5 4 0 1
6 4 0 2
7 4 0 3
8 4 0 4
9 5 1 0
10 5 0 1
11 5 0 2
12 5 0 3
13 6 1 0
14 7 1 0
If you want the NaN based on diff: you can mask the output:
df['new'] = (df.groupby(df['Value'].ne(df['Value'].shift()).cumsum()).cumcount()
.mask(df['diff'].isna())
)
output:
Value diff new
0 6 NaN NaN
1 5 -1.0 0.0
2 5 0.0 1.0
3 5 0.0 2.0
4 4 -1.0 0.0
5 4 0.0 1.0
6 4 0.0 2.0
7 4 0.0 3.0
8 4 0.0 4.0
9 5 1.0 0.0
10 5 0.0 1.0
11 5 0.0 2.0
12 5 0.0 3.0
13 6 1.0 0.0
14 7 1.0 0.0

If performance is important count consecutive 0 values from difference column:
m = df['diff'].eq(0)
b = m.cumsum()
df['out'] = b.sub(b.mask(m).ffill().fillna(0)).astype(int)
print (df)
Value diff need out
0 6 NaN na 0
1 5 -1.0 0 0
2 5 0.0 1 1
3 5 0.0 2 2
4 4 -1.0 0 0
5 4 0.0 1 1
6 4 0.0 2 2
7 4 0.0 3 3
8 4 0.0 4 4
9 5 1.0 0 0
10 5 0.0 1 1
11 5 0.0 2 2
12 5 0.0 3 3
13 6 1.0 0 0
14 7 1.0 0 0

Calculate the amount of consecutive missing values in a row

I am trying to find a way to calculate the amount of values randomly removed from a data frame and the amount of values randomly removed one after another.
The code I have so far is:
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
#Sampledata
x=[1,2,3,4,5,6,7,8,9,10]
y=[1,2,3,4,5,6,7,8,9,10]
df = pd.DataFrame({'col_1':y,'col_2':x})
drop_indices = np.random.choice(df.index, 5,replace=False )
df_subset = df.drop(drop_indices)
print(df_subset)
print(df)
Which randomly removes 5 rows from the data frame and gives as output:
col_1 col_2
0 1 1
1 2 2
2 3 3
5 6 6
8 9 9
col_1 col_2
0 1 1
1 2 2
2 3 3
3 4 4
4 5 5
5 6 6
6 7 7
7 8 8
8 9 9
9 10 10
I would like to turn this into the following data frame:
col_1 col_2 col_2 N_removedvalues N_consecutive
0 1 1 1 0 0
1 2 2 2 0 0
2 3 3 3 0 0
3 4 4 1 1
4 5 5 2 2
5 6 6 6 2 0
6 7 7 3 1
7 8 8 4 2
8 9 9 9 4 0
9 10 10 5 1

res=df.merge(df_subset, on='col_1', suffixes=['_1',''], how='left')
res["N_removedvalues"]=np.where(res['col_2'].isna(), res.groupby(res['col_2'].isna()).cumcount().add(1), np.nan)
res["N_removedvalues"]=res["N_removedvalues"].ffill().fillna(0)
res['N_consecutive']=np.logical_and(res['col_2'].isna(), np.logical_or(~res['col_2'].shift().isna(), res.index==res.index[0]))
res.loc[np.logical_and(res['N_consecutive']==0, res['col_2'].isna()), 'N_consecutive']=np.nan
res['N_consecutive']=res.groupby('N_consecutive')['N_consecutive'].cumsum().ffill()
res.loc[res['N_consecutive'].gt(0), 'N_consecutive']=res.loc[res['N_consecutive'].gt(0)].groupby('N_consecutive').cumcount().add(1)
Outputs:
col_1 col_2_1 col_2 N_removedvalues N_consecutive
0 1 1 1.0 0.0 0.0
1 2 2 2.0 0.0 0.0
2 3 3 NaN 1.0 1.0
3 4 4 4.0 1.0 0.0
4 5 5 NaN 2.0 1.0
5 6 6 NaN 3.0 2.0
6 7 7 7.0 3.0 0.0
7 8 8 8.0 3.0 0.0
8 9 9 NaN 4.0 1.0
9 10 10 NaN 5.0 2.0

pandas timeseries splitting into many and taking the mean

I have the following pandas dataframe:
SEC POS DATA
1 1 4
2 1 4
3 1 5
4 1 5
5 2 2
6 3 4
7 3 2
8 4 2
9 4 2
10 1 8
11 1 6
12 2 5
13 2 5
14 2 4
15 2 6
16 3 2
17 4 1
Now I want to know the mean value of DATA and the first value of SEC for every block of the POS column.
So like this:
SEC POS DATA
1 1 4.5
5 2 2
6 3 3
8 4 2
10 1 7
12 2 5
16 3 2
17 4 1
Additionally, I want to subtract the DATA value of POS=4 from it's 3 prior DATA values, so where POS = [1,2,3].
Obtaining the following:
SEC POS DATA
1 1 2.5
5 2 0
6 3 1
8 4 2
10 1 6
12 2 4
16 3 1
17 4 1
I figured out how to do this by separating the dataframe in many different dataframes using a forloop. taking the mean and then subtract for the other dataframes. However this is very slow, so I'm wondering if there's a faster way to do this, anyone that can help?
Thanks!

Another solution:
diff_to_previous = df.POS != df.POS.shift(1)
df = df.groupby(diff_to_previous.cumsum(), as_index=False).agg({'SEC': 'first', 'POS':'first', 'DATA':'mean'})
df['tmp'] = (df['POS'] == 4).astype(int).shift(fill_value=0).cumsum()
df['DATA'] = df.groupby('tmp')['DATA'].transform(lambda x: [*(x[x.index[:-1]] - x[x.index[-1]]), x[x.index[-1]]] )
df = df.drop(columns='tmp')
print(df)
Prints:
SEC POS DATA
0 1 1 2.5
1 5 2 0.0
2 6 3 1.0
3 8 4 2.0
4 10 1 6.0
5 12 2 4.0
6 16 3 1.0
7 17 4 1.0

For your first problem, we can use:
grps = df['POS'].ne(df['POS'].shift()).cumsum()
dfg = df.groupby(grps).agg(
POS=('POS', 'min'),
SEC=('SEC', 'min'),
DATA=('DATA', 'mean')
).reset_index(drop=True)
POS SEC DATA
0 1 1 4.5
1 2 5 2.0
2 3 6 3.0
3 4 8 2.0
4 1 10 7.0
5 2 12 5.0
6 3 16 2.0
7 4 17 1.0
For your second problem:
grps2 = dfg['POS'].lt(dfg['POS'].shift()).cumsum()
m = (
dfg.groupby(grps2)
.apply(lambda x: x.loc[x['POS'].isin([1,2,3]), 'DATA']
- x.loc[x['POS'].eq(4), 'DATA'].iat[0])
.droplevel(0)
)
dfg['DATA'].update(m)
POS SEC DATA
0 1 1 2.5
1 2 5 0.0
2 3 6 1.0
3 4 8 2.0
4 1 10 6.0
5 2 12 4.0
6 3 16 1.0
7 4 17 1.0

Pandas: Groupby two columns and count the occurence of all values for 2nd column

I want to groupby my dataframe using two columns, one is yearmonth(format : 16-10) and other is number of cust. Then if number of cumstomers are more the six, i want to create one one row which replaces all the rows with number of cust = 6+ and sum of total values for number of cust >6.
This is how data looks like
index month num ofcust count
0 10 1.0 1
1 10 2.0 1
2 10 3.0 1
3 10 4.0 1
4 10 5.0 1
5 10 6.0 1
6 10 7.0 1
7 10 8.0 1
8 11 1.0 1
9 11 2.0 1
10 11 3.0 1
11 12 12.0 1
Output:
index month no of cust count
0 16-10 1.0 3
1 16-10 2.0 6
2 16-10 3.0 2
3 16-10 4.0 3
4 16-10 5.0 4
5 16-10 6+ 4
6 16-11 1.0 4
7 16-11 2.0 3
8 16-11 3.0 2
9 16-11 4.0 1
10 16-11 5.0 3
11 16-11 6+ 5

I believe you need replace all values >=6 first and then groupby + aggregate sum:
s = df['num ofcust'].mask(df['num ofcust'] >=6, '6+')
#alternatively
#s = df['num ofcust'].where(df['num ofcust'] <6, '6+')
df = df.groupby(['month', s])['count'].sum().reset_index()
print (df)
month num ofcust count
0 10 1 1
1 10 2 1
2 10 3 1
3 10 4 1
4 10 5 1
5 10 6+ 3
6 11 1 1
7 11 2 1
8 11 3 1
9 12 6+ 1
Detail:
print (s)
0 1
1 2
2 3
3 4
4 5
5 6+
6 6+
7 6+
8 1
9 2
10 3
11 6+
Name: num ofcust, dtype: object
Another very similar solution is append data to column first:
df.loc[df['num ofcust'] >= 6, 'num ofcust'] = '6+'
df = df.groupby(['month', 'num ofcust'], as_index=False)['count'].sum()
print (df)
month num ofcust count
0 10 1 1
1 10 2 1
2 10 3 1
3 10 4 1
4 10 5 1
5 10 6+ 3
6 11 1 1
7 11 2 1
8 11 3 1
9 12 6+ 1

Count distinct strings in rolling window using pandas

How do I count the number of unique strings in a rolling window of a pandas dataframe?
a = pd.DataFrame(['a','b','a','a','b','c','d','e','e','e','e'])
a.rolling(3).apply(lambda x: len(np.unique(x)))
Output, same as original dataframe:
0
0 a
1 b
2 a
3 a
4 b
5 c
6 d
7 e
8 e
9 e
10 e
Expected:
0
0 1
1 2
2 2
3 2
4 2
5 3
6 3
7 3
8 2
9 1
10 1

I think you need first convert values to numeric - by factorize or by rank. Also min_periods parameter is necessary for avoid NaN in start of column:
a[0] = pd.factorize(a[0])[0]
print (a)
0
0 0
1 1
2 0
3 0
4 1
5 2
6 3
7 4
8 4
9 4
10 4
b = a.rolling(3, min_periods=1).apply(lambda x: len(np.unique(x))).astype(int)
print (b)
0
0 1
1 2
2 2
3 2
4 2
5 3
6 3
7 3
8 2
9 1
10 1
Or:
a[0] = a[0].rank(method='dense')
0
0 1.0
1 2.0
2 1.0
3 1.0
4 2.0
5 3.0
6 4.0
7 5.0
8 5.0
9 5.0
10 5.0
b = a.rolling(3, min_periods=1).apply(lambda x: len(np.unique(x))).astype(int)
print (b)
0
0 1
1 2
2 2
3 2
4 2
5 3
6 3
7 3
8 2
9 1
10 1

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

How to "iron out" a column of numbers with duplicates in it - python

Related

How to find out the cumulative count between numbers?

Calculate the amount of consecutive missing values in a row

pandas timeseries splitting into many and taking the mean

Pandas: Groupby two columns and count the occurence of all values for 2nd column

Count distinct strings in rolling window using pandas

Categories

Resources