Find maximum and minimum value of five consecutive rows by column - python

I want to get the maximum and minimum value of some columns grouped by 5 consecutive values. Example, I want to have maximum by a and minimum by b, of 5 consecutive rows
a b
0 1 2
1 2 3
2 3 4
3 2 5
4 1 1
5 3 6
6 2 8
7 5 2
8 4 6
9 2 7
I want to have
a b
0 3 1
1 5 2
(Where 3 is the maximum of 1,2,3,2,1 and 1 is the minumum of 2,3,4,5,1, and so on)

Use integer division (//) to form the index for grouping by every 5 items, and then use groupby and agg:
out = df.groupby(df.index // 5).agg({'a':'max', 'b':'min'})
Output:
>>> out
a b
0 3 1
1 5 2

Related

Appending a list to dataframe and adding count

I have a pandas data frame and a list -
d={'abc':[0,2,4,5,2,2],'bec':[0,5,6,4,0,2],'def':[7,6,0,1,1,2],'rtr':[5,6,7,2,0,3],'rwr':[5,6,7,1,0,5],'xx':[4,5,6,7,8,7]}
X=pd.DataFrame(d)
abc bec def rtr rwr xx
0 0 0 7 5 5 4
1 2 5 6 6 6 5
2 4 6 0 7 7 6
3 5 4 1 2 1 7
4 2 0 1 0 0 8
5 2 2 2 3 5 7
l=[ 'bec','def','cef','ghd','rtr','fgh','ewr']
Now I want to append the list to data frame in the following way-
For each row in dataframe- We count the number of non zero elements in it(lets say it is 3 in the first row
We take 50% of 3= 1.5 (rounded to 1) and we append those many elements from the list l to the row(starting from the beginning). For the first row it is 'bec', since 'bec' is already present in the
row we increase its count by 1.
If the element from list is not present in the dataframe we append it at the end.
Dry run-
for row 1(index 1)- no of non zero elements is 6. So 50% of it is 3. So we take the first 3 elements from list['bec','def','cef']. 'bec' is already present so its count increases by 1 and it becomes(2,2)=6.
Similarly 'def' is present so it becomes(2,3)=7. 'cef' isn't present in the dataframe so we add it and make the count as 1.
The final output looks like this-
abc bec def rtr rwr xx cef
0 0 1 8 5 5 4 0
1 2 6 7 6 6 5 1
2 4 7 1 7 7 6 0
3 5 5 2 2 1 7 1
4 2 1 1 0 0 8 0
5 2 1 1 3 5 7 1
We can use ne + sum along axis=1 to count the non zero values in each row, followed by floordiv with 2 to consider only 50% of these counts, next create a list of record with the help of dict.fromkeys method inside a list comprehension, now create a dataframe lets say y from these records and add it with X to get the desired result
y = pd.DataFrame(dict.fromkeys(l[:i], 1)
for i in X.ne(0).sum(1).floordiv(2).astype(int))
X.add(y.fillna(0), fill_value=0).astype(int)
abc bec cef def rtr rwr xx
0 0 1 0 8 5 5 4
1 2 6 1 7 6 6 5
2 4 7 0 1 7 7 6
3 5 5 1 2 2 1 7
4 2 1 0 1 0 0 8
5 2 3 1 3 3 5 7

Conditional cumcount of values in second column

I want to fill numbers in column flag, based on the value in column KEY.
Instead of using cumcount() to fill incremental numbers, I want to fill same number for every two rows if the value in column KEY stays same.
If the value in column KEY changes, the number filled changes also.
Here is the example, df1 is what I want from df0.
df0 = pd.DataFrame({'KEY':['0','0','0','0','1','1','1','2','2','2','2','2','3','3','3','3','3','3','4','5','6']})
df1 = pd.DataFrame({'KEY':['0','0','0','0','1','1','1','2','2','2','2','2','3','3','3','3','3','3','4','5','6'],
'flag':['0','0','1','1','2','2','3','4','4','5','5','6','7','7','8','8','9','9','10','11','12']})
You want to get the cumcount and add one. Then use %2 to differentiate between odd or even rows. Then, take the cumulative sum and subtract 1 to start counting from zero.
You can use:
df0['flag'] = ((df0.groupby('KEY').cumcount() + 1) % 2).cumsum() - 1
df0
Out[1]:
KEY flag
0 0 0
1 0 0
2 0 1
3 0 1
4 1 2
5 1 2
6 1 3
7 2 4
8 2 4
9 2 5
10 2 5
11 2 6
12 3 7
13 3 7
14 3 8
15 3 8
16 3 9
17 3 9
18 4 10
19 5 11
20 6 12

Filtering pandas dataframe groups based on groups comparison

I am trying to remove corrupted data from my pandas dataframe. I want to remove groups from dataframe that has difference of value bigger than one from the last group. Here is an example:
Value
0 1
1 1
2 1
3 2
4 2
5 2
6 8 <- here number of group if I groupby by Value is larger than
7 8 the last groups number by 6, so I want to remove this
8 3 group from dataframe
9 3
Expected result:
Value
0 1
1 1
2 1
3 2
4 2
5 2
6 3
7 3
Edit:
jezrael solution is great, but in my case it is possible that there will be dubplicate group values:
Value
0 1
1 1
2 1
3 3
4 3
5 3
6 1
7 1
Sorry if I was not clear about this.
First remove duplicates for unique rows, then compare difference with shifted values and last filter by boolean indexing:
s = df['Value'].drop_duplicates()
v = s[s.diff().gt(s.shift())]
df = df[~df['Value'].isin(v)]
print (df)
Value
0 1
1 1
2 1
3 2
4 2
5 2
8 3
9 3
Maybe:
df2 = df.drop_duplicates()
print(df[df['Value'].isin(df2.loc[~df2['Value'].gt(df2['Value'].shift(-1)), 'Value'].tolist())])
Output:
Value
0 1
1 1
2 1
3 2
4 2
5 2
8 3
9 3
We can check if the difference is less than or equal to 5 or NaN. After we check if we have duplicates and keep those rows:
s = df[df['Value'].diff().le(5) | df['Value'].diff().isna()]
s[s.duplicated(keep=False)]
Value
0 1
1 1
2 1
3 2
4 2
5 2
8 3
9 3

Binning all values with pandas.cut

I have a dataframe that looks like the following:
index value
1 21.046091
2 52.400000
3 14.082153
4 1.859942
5 1.859942
6 2.331143
7 9.060000
8 0.789265
9 12967.7
The last value is much higher than the rest. I'm trying to bin all the values into 5 bins using pd.cut:
pd.cut(df['value'], 5, labels = [1,2,3,4,5])
But it only ends up returning the groups 1 and 5.
index value group
0 0.410000 1
1 21.046091 1
2 52.400000 1
3 14.082153 1
4 1.859942 1
5 1.859942 1
6 2.331143 1
7 9.060000 1
8 0.789265 1
9 12967.7 5
The higher value is clearly throwing it, but is there a way to ensure that all five bins are represented in the dataframe without getting rid of outlying values?
You could use qcut:
pd.qcut(df['value'],5,labels=[1,2,3,4,5])
Output:
index
1 4
2 5
3 4
4 1
5 1
6 2
7 3
8 1
9 5
Name: value, dtype: category
Categories (5, int64): [1 < 2 < 3 < 4 < 5]
print(df.assign(group = pd.qcut(df['value'],5,labels=[1,2,3,4,5])))
value group
index
1 21.046091 4
2 52.400000 5
3 14.082153 4
4 1.859942 1
5 1.859942 1
6 2.331143 2
7 9.060000 3
8 0.789265 1
9 12967.700000 5

Sort pandas DataFrame by multiple columns and duplicated index

I have a pandas DataFrame with duplicated indices. There are 3 rows with each index, and they correspond to a group of items. There are two columns, a and b.
df = pandas.DataFrame([{'i': b % 4, 'a': abs(b - 6) , 'b': b}
for b in range(12)]).set_index('i')
I want to sort the DataFrame so that:
All of the rows with the same indices are adjacent. (all of the groups are together)
The groups are in reverse order by the lowest value of a within the group.
For example, in the above df, the first three items should be the ones with index 0, because the lowest a value for those three rows is 2, and all of the other groups have at least one row with an a value lower than 2. The second three items could be either group 3 or group 1, because the lowest a value in both of those groups is 1. The last group of items should be group 2, because it has a row with an a value of 0.
Within each group, the items are sorted in ascending order by b.
Desired output:
a b
i
0 6 0
0 2 4
0 2 8
3 3 3
3 1 7
3 5 11
1 5 1
1 1 5
1 3 9
2 4 2
2 0 6
2 4 10
I've been trying something like:
df.groupby('i')[['a']].transform(min).sort(['a', 'b'], ascending=[0, 1])
But it gives me a KeyError, and it only gets that far if I make i a column instead of an index anyway.
The most straightforward way I see is moving your index to a column, and calculating a new column with the group min.
In [43]: df = df.reset_index()
In [45]: df['group_min'] = df.groupby('i')['a'].transform('min')
Then you can sort by your conditions:
In [49]: df.sort_values(['group_min', 'i', 'b'], ascending=[False, False, True])
Out[49]:
i a b group_min
0 0 6 0 2
4 0 2 4 2
8 0 2 8 2
3 3 3 3 1
7 3 1 7 1
11 3 5 11 1
1 1 5 1 1
5 1 1 5 1
9 1 3 9 1
2 2 4 2 0
6 2 0 6 0
10 2 4 10 0
To get back to your desired frame, drop the tracking variable and reset the index.
In [50]: df.sort_values(['group_min', 'i', 'b'], ascending=[False, False, True]).drop('group_min', axis=1).set_index('i')
Out[50]:
a b
i
0 6 0
0 2 4
0 2 8
3 3 3
3 1 7
3 5 11
1 5 1
1 1 5
1 3 9
2 4 2
2 0 6
2 4 10
You can first sort by a in descending order and then sort your index:
>>> df.sort(['a', 'b'], ascending=[False, True]).sort_index()
a b
i
0 6 0
0 2 4
0 2 8
1 5 1
1 3 9
1 1 5
2 4 2
2 4 10
2 0 6
3 5 11
3 3 3
3 1 7

Categories

Resources