Pandas: Drop and count consecutive duplicates with condition - python

I want to drop and count duplicates in column val when val equal to 1.
Then set start to be the first row and end to be the last row of consecutive duplicates.
df = pd.DataFrame()
df['start'] = [1, 2, 3, 4, 5, 6, 18, 30, 31]
df['end'] = [2, 3, 4, 5, 6, 18, 30, 31, 32]
df['val'] = [1 , 1, 1, 1, 1, 12, 12, 1, 1]
df
start end val
0 1 2 1
1 2 3 1
2 3 4 1
3 4 5 1
4 5 6 1
5 6 18 12
6 18 30 12
7 30 31 1
8 31 32 1
Expected Result
start end val
0 1 6 5
1 6 18 12
2 18 30 12
3 30 32 2
I tried. df[~((df.val==1) & (df.val == df.val.shift(1)) & (df.val == df.val.shift(-1)))]
start end val
0 1 2 1
4 5 6 1
5 6 18 12
6 18 30 12
7 30 31 1
8 31 32 1
but I can't figure out how to complete my expected result, any suggestion?

Use:
#mask by condition
m = df.val==1
#consecutive groups
g = m.ne(m.shift()).cumsum()
#filter by condition and aggregate per groups
df1 = df.groupby(g[m]).agg({'start':'first', 'end':'last', 'val':'sum'})
#concat together, for correct order create index by g
df = pd.concat([df1, df.set_index(g)[~m.values]]).sort_index().reset_index(drop=True)
print (df)
start end val
0 1 6 5
1 6 18 12
2 18 30 12
3 30 32 2

You could also do a two-liner with a mask to groupby:
m = (df.val.ne(1) | df.val.ne(df.val.shift())).cumsum()
df = df.groupby(m).agg({'start': 'first', 'end': 'last', 'val': 'last'})

Solution by #jezrael is perfect, but here is slightly different approach:
df['aux'] = (df['val'] != df['val'].shift()).cumsum()
df.loc[df['val'] == 1, 'end'] = df[df['val'] == 1].groupby('aux')['end'].transform('last')
df.loc[df['val'] == 1, 'val'] = df.groupby('aux')['val'].transform('sum')
df = df.drop_duplicates(subset=df.columns.difference(['start']), keep='first')
df = df.drop(columns=['aux'])

Related

Delete rows in apply() function or depending on apply() result

Here I have a working solution but my question focus on how to do this the Pandas way. I assume Pandas over better solutions for this.
I use groupby() and then apply(axis=1) to compare the values in the rows of the groups. And while doing this I made the decision which row to delete.
The rule doesn't matter! In this example here the rule is that when values in column A differ only by 1 (the values are "near") then delete the second one. How the decision is made is not part of the question. There could also be a list of color names and I would say that darkblue and marineblue are "near" and one if should be deleted.
The initial data frame is that.
X A B
0 A 9 0 <--- DELETE
1 A 14 1
2 A 8 2
3 A 1 3
4 A 18 4
5 B 10 5
6 B 20 6
7 B 11 7 <--- DELETE
8 B 30 8
Row index 0 should be deleted because it's value 9 is near the value 8 in row index 2. The same with row index 7: It's value 11 is "near" 10 in row index 5.
That is the code
#!/usr/bin/env python3
import pandas as pd
df = pd.DataFrame(
{
'X': list('AAAAABBBB'),
'A': [9, 14, 8, 1, 18, 10, 20, 11, 30],
'B': range(9)
}
)
print(df)
def mark_near_neighbors(group):
# I snip the decission process here.
# Delete 9 because it is "near" 8.
default_result = pd.Series(
data=[False] * len(group),
index=['Delete'] * len(group)
)
if group.X.iloc[0] is 'A':
# the 9
default_result.iloc[0] = True
else:
# the 11
default_result.iloc[2] = True
return default_result
result = df.groupby('X').apply(mark_near_neighbors)
result = result.reset_index(drop=True)
print(result)
df = df.loc[~result]
print(df)
So in the end I use a "boolean indexing thing" to solve this
0 True
1 False
2 False
3 False
4 False
5 False
6 False
7 True
8 False
dtype: bool
But is there a better way to do this?
Initialize the dataframe
df = pd.DataFrame([
['A', 9, 0],
['A', 14, 1],
['A', 8, 2],
['A', 1, 3],
['B', 18, 4],
['B', 10, 5],
['B', 20, 6],
['B', 11, 7],
['B', 30, 8],
], columns=['X', 'A', 'B'])
Sort the dataframe based on A column
df = df.sort_values('A')
Find the difference between values
df["diff" ] =df.groupby('X')['A'].diff()
Select the rows where the difference is not 1
result = df[df["diff"] != 1.0]
Drop the extra column and sort by index to get the initial dataframe
result.drop("diff", axis=1, inplace=True)
result = result.sort_index()
Sample output
X A B
1 A 14 1
2 A 8 2
3 A 1 3
4 B 18 4
5 B 10 5
6 B 20 6
8 B 30 8
IIUC, you can use numpy broadcasting to compare all values within a group. Keeping everything with apply here as it seems wanted:
def mark_near_neighbors(group, thresh=1):
a = group.to_numpy().astype(float)
idx = np.argsort(a)
b = a[idx]
d = abs(b-b[:,None])
d[np.triu_indices(d.shape[0])] = thresh+1
return pd.Series((d>thresh).all(1)[np.argsort(idx)], index=group.index)
out = df[df.groupby('X')['A'].apply(mark_near_neighbors)]
output:
X A B
1 A 14 1
2 A 8 2
3 A 1 3
4 A 18 4
5 B 10 5
6 B 20 6
8 B 30 8

How to remove rows from DF as result of a groupby query?

I have this Pandas dataframe:
df = pd.DataFrame({'site': ['a', 'a', 'a', 'b', 'b', 'b', 'a', 'a', 'a'], 'day': [1, 1, 1, 1, 1, 1, 2, 2, 2],
'hour': [1, 2, 3, 1, 2, 3, 1, 2, 3], 'clicks': [100, 200, 50, 0, 0, 0, 10, 0, 20]})
# site day hour clicks
# 0 a 1 1 100
# 1 a 1 2 200
# 2 a 1 3 50
# 3 b 1 1 0
# 4 b 1 2 0
# 5 b 1 3 0
# 6 a 2 1 10
# 7 a 2 2 0
# 8 a 2 3 20
And I want to remove all rows for a site/day, where there were 0 clicks. So in the example above, I would want to remove the rows with site='b' and day =1.
I can basically group them and show where the sum is 0 for a day/site:
print(df.groupby(['site', 'day'])['clicks'].sum() == 0)
But how would now be straight-forward way to remove the rows from original dataframe where that condition applies?
Solution I am having so far is that I iterate over group and save all tuples of site/day in a list, and then separately remove all rows that have that combinations of site/day. That works but, I am sure there must be a more functional and elegant way to achieve that result?
Option 1
Using groupby, transform and boolean indexing:
df[df.groupby(['site', 'day'])['clicks'].transform('sum') != 0]
Output:
site day hour clicks
0 a 1 1 100
1 a 1 2 200
2 a 1 3 50
6 a 2 1 10
7 a 2 2 0
8 a 2 3 20
Option 2
Using groupby and filter:
df.groupby(['site', 'day']).filter(lambda x: x['clicks'].sum() != 0)
Output:
site day hour clicks
0 a 1 1 100
1 a 1 2 200
2 a 1 3 50
6 a 2 1 10
7 a 2 2 0
8 a 2 3 20

Count appearances of a value until it changes to another value

I have the following DataFrame:
df = pd.DataFrame([10, 10, 23, 23, 9, 9, 9, 10, 10, 10, 10, 12], columns=['values'])
I want to calculate the frequency of each value, but not an overall count - the count of each value until it changes to another value.
I tried:
df['values'].value_counts()
but it gives me
10 6
9 3
23 2
12 1
The desired output is
10:2
23:2
9:3
10:4
12:1
How can I do this?
Use:
df = df.groupby(df['values'].ne(df['values'].shift()).cumsum())['values'].value_counts()
Or:
df = df.groupby([df['values'].ne(df['values'].shift()).cumsum(), 'values']).size()
print (df)
values values
1 10 2
2 23 2
3 9 3
4 10 4
5 12 1
Name: values, dtype: int64
Last for remove first level:
df = df.reset_index(level=0, drop=True)
print (df)
values
10 2
23 2
9 3
10 4
12 1
dtype: int64
Explanation:
Compare original column by shifted with not equal ne and then add cumsum for helper Series:
print (pd.concat([df['values'], a, b, c],
keys=('orig','shifted', 'not_equal', 'cumsum'), axis=1))
orig shifted not_equal cumsum
0 10 NaN True 1
1 10 10.0 False 1
2 23 10.0 True 2
3 23 23.0 False 2
4 9 23.0 True 3
5 9 9.0 False 3
6 9 9.0 False 3
7 10 9.0 True 4
8 10 10.0 False 4
9 10 10.0 False 4
10 10 10.0 False 4
11 12 10.0 True 5
You can keep track of where the changes in df['values'] occur, and groupby the changes and also df['values'] (to keep them as index) computing the size of each group
changes = df['values'].diff().ne(0).cumsum()
df.groupby([changes,'values']).size().reset_index(level=0, drop=True)
values
10 2
23 2
9 3
10 4
12 1
dtype: int64
itertools.groupby
from itertools import groupby
pd.Series(*zip(*[[len([*v]), k] for k, v in groupby(df['values'])]))
10 2
23 2
9 3
10 4
12 1
dtype: int64
It's a generator
def f(x):
count = 1
for this, that in zip(x, x[1:]):
if this == that:
count += 1
else:
yield count, this
count = 1
yield count, [*x][-1]
pd.Series(*zip(*f(df['values'])))
10 2
23 2
9 3
10 4
12 1
dtype: int64
Using crosstab
df['key']=df['values'].diff().ne(0).cumsum()
pd.crosstab(df['key'],df['values'])
Out[353]:
values 9 10 12 23
key
1 0 2 0 0
2 0 0 0 2
3 3 0 0 0
4 0 4 0 0
5 0 0 1 0
Slightly modify the result above
pd.crosstab(df['key'],df['values']).stack().loc[lambda x:x.ne(0)]
Out[355]:
key values
1 10 2
2 23 2
3 9 3
4 10 4
5 12 1
dtype: int64
Base on python groupby
from itertools import groupby
[ (k,len(list(g))) for k,g in groupby(df['values'].tolist())]
Out[366]: [(10, 2), (23, 2), (9, 3), (10, 4), (12, 1)]
This is far from the most time/memory efficient method that in this thread but here's an iterative approach that is pretty straightforward. Please feel encouraged to suggest improvements on this method.
import pandas as pd
df = pd.DataFrame([10, 10, 23, 23, 9, 9, 9, 10, 10, 10, 10, 12], columns=['values'])
dict_count = {}
for v in df['values'].unique():
dict_count[v] = 0
curr_val = df.iloc[0]['values']
count = 1
for i in range(1, len(df)):
if df.iloc[i]['values'] == curr_val:
count += 1
else:
if count > dict_count[curr_val]:
dict_count[curr_val] = count
curr_val = df.iloc[i]['values']
count = 1
if count > dict_count[curr_val]:
dict_count[curr_val] = count
df_count = pd.DataFrame(dict_count, index=[0])
print(df_count)
The function groupby in itertools can help you, for str:
>>> string = 'aabbaacc'
>>> for char, freq in groupby('aabbaacc'):
>>> print(char, len(list(freq)), sep=':', end='\n')
[out]:
a:2
b:2
a:2
c:2
This function also works for list:
>>> df = pd.DataFrame([10, 10, 23, 23, 9, 9, 9, 10, 10, 10, 10, 12], columns=['values'])
>>> for char, freq in groupby(df['values'].tolist()):
>>> print(char, len(list(freq)), sep=':', end='\n')
[out]:
10:2
23:2
9:3
10:4
12:1
Note: for df, you always use this way like df['values'] to take 'values' column, because DataFrame have a attribute values

Selecting rows from pandas dataframe limited by count per column value

I have a dataframe defined as follows:
df = pd.DataFrame({'id': [11, 12, 13, 14, 21, 22, 31, 32, 33],
'class': ['A', 'A', 'A', 'A', 'B', 'B', 'C', 'C', 'C'],
'count': [2, 2, 2 ,2 ,1, 1, 2, 2, 2]})
For each class, I'd like to select top n rows where n is specified by count column. The expected output from the above dataframe would be like this:
How can I achieve this?
You could use
In [771]: df.groupby('class').apply(
lambda x: x.head(x['count'].iloc[0])
).reset_index(drop=True)
Out[771]:
id class count
0 11 A 2
1 12 A 2
2 21 B 1
3 31 C 2
4 32 C 2
Use:
(df.groupby('class', as_index=False, group_keys=False)
.apply(lambda x: x.head(x['count'].iloc[0])))
Output:
id class count
0 11 A 2
1 12 A 2
4 21 B 1
6 31 C 2
7 32 C 2
Using cumcount
df[(df.groupby('class').cumcount()+1).le(df['count'])]
Out[150]:
class count id
0 A 2 11
1 A 2 12
4 B 1 21
6 C 2 31
7 C 2 32
Here is a solution which groups by class then then looks at the first value in the smaller dataframe and returns the corresponding rows.
def func(df_):
count_val = df_['count'].values[0]
return df_.iloc[0:count_val]
df.groupby('class', group_keys=False).apply(func)
returns
class count id
0 A 2 11
1 A 2 12
4 B 1 21
6 C 2 31
7 C 2 32

How to combine single and multiindex Pandas DataFrames

I am trying to concatenate multiple Pandas DataFrames, some of which use multi-indexing and others use single indices. As an example, let's consider the following single indexed dataframe:
> import pandas as pd
> df1 = pd.DataFrame({'single': [10,11,12]})
> df1
single
0 10
1 11
2 12
Along with a multiindex dataframe:
> level_dict = {}
> level_dict[('level 1','a','h')] = [1,2,3]
> level_dict[('level 1','b','j')] = [5,6,7]
> level_dict[('level 2','c','k')] = [10, 11, 12]
> level_dict[('level 2','d','l')] = [20, 21, 22]
> df2 = pd.DataFrame(level_dict)
> df2
level 1 level 2
a b c d
h j k l
0 1 5 10 20
1 2 6 11 21
2 3 7 12 22
Now I wish to concatenate the two dataframes. When I try to use concat it flattens the multiindex as follows:
> df3 = pd.concat([df2,df1], axis=1)
> df3
(level 1, a, h) (level 1, b, j) (level 2, c, k) (level 2, d, l) single
0 1 5 10 20 10
1 2 6 11 21 11
2 3 7 12 22 12
If instead I append a single column to the multiindex dataframe df2 as follows:
> df2['single'] = [10,11,12]
> df2
level 1 level 2 single
a b c d
h j k l
0 1 5 10 20 10
1 2 6 11 21 11
2 3 7 12 22 12
How can I instead generate this dataframe from df1 and df2 with concat, merge, or join?
I don't think you can avoid converting the single index into a MultiIndex. This is probably the easiest way, you could also convert after joining.
In [48]: df1.columns = pd.MultiIndex.from_tuples([(c, '', '') for c in df1])
In [49]: pd.concat([df2, df1], axis=1)
Out[49]:
level 1 level 2 single
a b c d
h j k l
0 1 5 10 20 10
1 2 6 11 21 11
2 3 7 12 22 12
If you're just appending one column you could access df1 essentially as a series:
df2[df1.columns[0]] = df1.iloc[:, 0]
df2
level 1 level 2 single
a b c d
h j k l
0 1 5 10 20 10
1 2 6 11 21 11
2 3 7 12 22 12
If you could have just made a series in the first place it would be a little easier to read. This command would do the same thing:
ser1 = df1.iloc[:, 0] # make df1's column into a series
df2[ser1.name] = ser1

Categories

Resources