Conditional cumcount of values in second column - python

I want to fill numbers in column flag, based on the value in column KEY.
Instead of using cumcount() to fill incremental numbers, I want to fill same number for every two rows if the value in column KEY stays same.
If the value in column KEY changes, the number filled changes also.
Here is the example, df1 is what I want from df0.
df0 = pd.DataFrame({'KEY':['0','0','0','0','1','1','1','2','2','2','2','2','3','3','3','3','3','3','4','5','6']})
df1 = pd.DataFrame({'KEY':['0','0','0','0','1','1','1','2','2','2','2','2','3','3','3','3','3','3','4','5','6'],
'flag':['0','0','1','1','2','2','3','4','4','5','5','6','7','7','8','8','9','9','10','11','12']})

You want to get the cumcount and add one. Then use %2 to differentiate between odd or even rows. Then, take the cumulative sum and subtract 1 to start counting from zero.
You can use:
df0['flag'] = ((df0.groupby('KEY').cumcount() + 1) % 2).cumsum() - 1
df0
Out[1]:
KEY flag
0 0 0
1 0 0
2 0 1
3 0 1
4 1 2
5 1 2
6 1 3
7 2 4
8 2 4
9 2 5
10 2 5
11 2 6
12 3 7
13 3 7
14 3 8
15 3 8
16 3 9
17 3 9
18 4 10
19 5 11
20 6 12

Related

Find maximum and minimum value of five consecutive rows by column

I want to get the maximum and minimum value of some columns grouped by 5 consecutive values. Example, I want to have maximum by a and minimum by b, of 5 consecutive rows
a b
0 1 2
1 2 3
2 3 4
3 2 5
4 1 1
5 3 6
6 2 8
7 5 2
8 4 6
9 2 7
I want to have
a b
0 3 1
1 5 2
(Where 3 is the maximum of 1,2,3,2,1 and 1 is the minumum of 2,3,4,5,1, and so on)
Use integer division (//) to form the index for grouping by every 5 items, and then use groupby and agg:
out = df.groupby(df.index // 5).agg({'a':'max', 'b':'min'})
Output:
>>> out
a b
0 3 1
1 5 2

How to Convert the row unique values in to columns

I have this dataFrame
dd = pd.DataFrame({'a':[1,1,1,1,2,2,2,2],'feature':[10,10,20,20,10,10,20,20],'h':['h_30','h_60','h_30','h_60','h_30','h_60','h_30','h_60'],'count':[1,2,3,4,5,6,7,8]})
a feature h count
0 1 10 h_30 1
1 1 10 h_60 2
2 1 20 h_30 3
3 1 20 h_60 4
4 2 10 h_30 5
5 2 10 h_60 6
6 2 20 h_30 7
7 2 20 h_60 8
My expected output is I want to shift my h column unique values into column and use count numbers as values
like this
a feature h_30 h_60
0 1 10 1 2
1 1 20 3 4
2 2 10 5 6
3 2 20 7 8
I tried this but got an error saying ValueError: Length of passed values is 8, index implies 2
dd.pivot(index = ['a','feature'],columns ='h',values = 'count' )
df.pivot does not accept list of columns as index for versions below 1.1.0
Changed in version 1.1.0: Also accept list of index names.
Try this:
import pandas as pd
pd.pivot_table(
dd, index=["a", "feature"], columns="h", values="count"
).reset_index().rename_axis(None, 1)

Complex comparison with multiple columns simultaneously

I have a following pandas sample dataset:
Dim1 Dim2 Dim3 Dim4
0 1 2 7 15
1 1 10 12 2
2 9 19 18 16
3 4 2 4 15
4 8 1 9 5
5 14 18 3 14
6 19 9 9 17
I want to make a complex comparison based on all 4 columns and generate a column called Domination_count. For every row, I want to calculate how many other rows the given one dominates. Domination is defined as "being better in one dimension, while not being worse in the others". A is better than B if the value of A is less than B.
The final result should become:
Dim1 Dim2 Dim3 Dim4 Domination_count
0 1 2 7 15 2
1 1 10 12 2 1
2 9 19 18 16 0
3 4 2 4 15 2
4 8 1 9 5 2
5 14 18 3 14 0
6 19 9 9 17 0
Some explanation behind the final numbers:
the option 0 is better than option 2 and 6
the option 1 is better than option 2
option 2, 5,6 are better than no other option
option 3 and 4 are better than option 2, 6
I could not think of any code that allows me to compare multiple columns simultaneously. I found this approach which does not do the comparison simultaneously.
Improving on the answer:
My first answer worked if there were no equal rows. In the case of equal rows they would increment the domination count because they are not worse than the other rows.
This somewhat simpler solution takes care of that problem.
#create a dataframe with a duplicate row
df = pd.DataFrame([[1, 2, 7, 15],[1, 10,12,2],[9, 19,18,16],[4, 2, 4, 15],[8, 1, 9, 5],[14,18,3, 14],[19,9, 9, 17], [14,18,3, 14]], #[14,18,3, 14]
columns = ['Dim1','Dim2','Dim3','Dim4']
)
df2 = df.copy()
def domination(row,df):
#filter for all rows where none of the columns are worse
df = df[(row <= df).all(axis = 1)]
#filter for rows where any column is better.
df = df[(row < df).any(axis = 1)]
return len(df)
df['Domination_count'] = df.apply(domination, args=[df], axis = 1)
df
This will correctly account for the criteria in the post and will not count the duplicate row in the domination column
Dim1 Dim2 Dim3 Dim4 Domination_count
0 1 2 7 15 2
1 1 10 12 2 1
2 9 19 18 16 0
3 4 2 4 15 2
4 8 1 9 5 2
5 14 18 3 14 0
6 19 9 9 17 0
7 14 18 3 14 0
My previous solution counts the equal rows:
df2['Domination_count'] = df2.apply(lambda x: (x <= df2).all(axis=1).sum() -1, axis=1)
df2
Dim1 Dim2 Dim3 Dim4 Domination_count
0 1 2 7 15 2
1 1 10 12 2 1
2 9 19 18 16 0
3 4 2 4 15 2
4 8 1 9 5 2
5 14 18 3 14 1
6 19 9 9 17 0
7 14 18 3 14 1
Original Solution
I like this as a solution. It takes each row of the dataframe and compares each element it to all rows of the dataframe to see if that element is less than or equal to the other rows (not worse than). Then, it counts the rows where all of the elements are not worse than the other rows. This counts the current row which is never worse than itself so we subtract 1.
df['Domination_count'] = df.apply(lambda x: (x <= df).all(axis=1).sum() -1, axis=1)
The result is:
Dim1 Dim2 Dim3 Dim4 Domination_count
0 1 2 7 15 2
1 1 10 12 2 1
2 9 19 18 16 0
3 4 2 4 15 2
4 8 1 9 5 2
5 14 18 3 14 0
6 19 9 9 17 0
In one line using list comprehension:
df['Domination_count'] = [(df.loc[df.index!=row] - df.loc[row].values.squeeze() > 0).all(axis = 1).sum() for row in df.index]
Subtract each row from all remaining rows elementwise, then count rows with all positive values (meaning that each corresponding value in the row we subtracted was lower) in the resulting dataframe.
I may have gotten your definition of domination wrong, so perhaps you'll need to change strict positivity check for whatever you need.
A simple iterative solution:
df['Domination_count']=0 #initialize column to zero
cols = df.columns[:-1] # select all columns but the domination_count
for i in range(len(df.index)): # loop through all the 4 columns
for j in range(len(df.index)):
if np.all(df.loc[i,cols]<=df.loc[j,cols]) and i!=j: # for every ith value check if its smaller than the jth value given that i!=j
df.loc[i,'Domination_count']+=1 #increment by 1

Filtering pandas dataframe groups based on groups comparison

I am trying to remove corrupted data from my pandas dataframe. I want to remove groups from dataframe that has difference of value bigger than one from the last group. Here is an example:
Value
0 1
1 1
2 1
3 2
4 2
5 2
6 8 <- here number of group if I groupby by Value is larger than
7 8 the last groups number by 6, so I want to remove this
8 3 group from dataframe
9 3
Expected result:
Value
0 1
1 1
2 1
3 2
4 2
5 2
6 3
7 3
Edit:
jezrael solution is great, but in my case it is possible that there will be dubplicate group values:
Value
0 1
1 1
2 1
3 3
4 3
5 3
6 1
7 1
Sorry if I was not clear about this.
First remove duplicates for unique rows, then compare difference with shifted values and last filter by boolean indexing:
s = df['Value'].drop_duplicates()
v = s[s.diff().gt(s.shift())]
df = df[~df['Value'].isin(v)]
print (df)
Value
0 1
1 1
2 1
3 2
4 2
5 2
8 3
9 3
Maybe:
df2 = df.drop_duplicates()
print(df[df['Value'].isin(df2.loc[~df2['Value'].gt(df2['Value'].shift(-1)), 'Value'].tolist())])
Output:
Value
0 1
1 1
2 1
3 2
4 2
5 2
8 3
9 3
We can check if the difference is less than or equal to 5 or NaN. After we check if we have duplicates and keep those rows:
s = df[df['Value'].diff().le(5) | df['Value'].diff().isna()]
s[s.duplicated(keep=False)]
Value
0 1
1 1
2 1
3 2
4 2
5 2
8 3
9 3

Columns that sums the last x instances of value y to another column

I have a dataset that looks like the following. The "HomeForm" column is what I'm trying to create and fill with values, i.e. the output.
HomeTeam AwayTeam FTHG FTAG HomeForm
Date
9 0 12 1 0
9 2 3 0 0
9 4 13 1 0
9 8 5 0 3
9 10 16 4 1
9 14 19 0 3
9 17 7 1 4
8 1 9 0 4
8 18 11 1 2
7 6 15 3 1
What I'm trying to do is to create another column called "HomeForm" that has, say, the sum of goals scored by the Home team in each of the last 6 matches. Bear in mind that the team can appear either in the "HomeTeam" column or in the "AwayTeam" column. What would be the best way to achieve this using python?
Thanks.

Categories

Resources