Complex comparison with multiple columns simultaneously

Complex comparison with multiple columns simultaneously - python

I have a following pandas sample dataset:
Dim1 Dim2 Dim3 Dim4
0 1 2 7 15
1 1 10 12 2
2 9 19 18 16
3 4 2 4 15
4 8 1 9 5
5 14 18 3 14
6 19 9 9 17
I want to make a complex comparison based on all 4 columns and generate a column called Domination_count. For every row, I want to calculate how many other rows the given one dominates. Domination is defined as "being better in one dimension, while not being worse in the others". A is better than B if the value of A is less than B.
The final result should become:
Dim1 Dim2 Dim3 Dim4 Domination_count
0 1 2 7 15 2
1 1 10 12 2 1
2 9 19 18 16 0
3 4 2 4 15 2
4 8 1 9 5 2
5 14 18 3 14 0
6 19 9 9 17 0
Some explanation behind the final numbers:
the option 0 is better than option 2 and 6
the option 1 is better than option 2
option 2, 5,6 are better than no other option
option 3 and 4 are better than option 2, 6
I could not think of any code that allows me to compare multiple columns simultaneously. I found this approach which does not do the comparison simultaneously.

Improving on the answer:
My first answer worked if there were no equal rows. In the case of equal rows they would increment the domination count because they are not worse than the other rows.
This somewhat simpler solution takes care of that problem.
#create a dataframe with a duplicate row
df = pd.DataFrame([[1, 2, 7, 15],[1, 10,12,2],[9, 19,18,16],[4, 2, 4, 15],[8, 1, 9, 5],[14,18,3, 14],[19,9, 9, 17], [14,18,3, 14]], #[14,18,3, 14]
columns = ['Dim1','Dim2','Dim3','Dim4']
)
df2 = df.copy()
def domination(row,df):
#filter for all rows where none of the columns are worse
df = df[(row <= df).all(axis = 1)]
#filter for rows where any column is better.
df = df[(row < df).any(axis = 1)]
return len(df)
df['Domination_count'] = df.apply(domination, args=[df], axis = 1)
df
This will correctly account for the criteria in the post and will not count the duplicate row in the domination column
Dim1 Dim2 Dim3 Dim4 Domination_count
0 1 2 7 15 2
1 1 10 12 2 1
2 9 19 18 16 0
3 4 2 4 15 2
4 8 1 9 5 2
5 14 18 3 14 0
6 19 9 9 17 0
7 14 18 3 14 0
My previous solution counts the equal rows:
df2['Domination_count'] = df2.apply(lambda x: (x <= df2).all(axis=1).sum() -1, axis=1)
df2
Dim1 Dim2 Dim3 Dim4 Domination_count
0 1 2 7 15 2
1 1 10 12 2 1
2 9 19 18 16 0
3 4 2 4 15 2
4 8 1 9 5 2
5 14 18 3 14 1
6 19 9 9 17 0
7 14 18 3 14 1
Original Solution
I like this as a solution. It takes each row of the dataframe and compares each element it to all rows of the dataframe to see if that element is less than or equal to the other rows (not worse than). Then, it counts the rows where all of the elements are not worse than the other rows. This counts the current row which is never worse than itself so we subtract 1.
df['Domination_count'] = df.apply(lambda x: (x <= df).all(axis=1).sum() -1, axis=1)
The result is:
Dim1 Dim2 Dim3 Dim4 Domination_count
0 1 2 7 15 2
1 1 10 12 2 1
2 9 19 18 16 0
3 4 2 4 15 2
4 8 1 9 5 2
5 14 18 3 14 0
6 19 9 9 17 0

In one line using list comprehension:
df['Domination_count'] = [(df.loc[df.index!=row] - df.loc[row].values.squeeze() > 0).all(axis = 1).sum() for row in df.index]
Subtract each row from all remaining rows elementwise, then count rows with all positive values (meaning that each corresponding value in the row we subtracted was lower) in the resulting dataframe.
I may have gotten your definition of domination wrong, so perhaps you'll need to change strict positivity check for whatever you need.

A simple iterative solution:
df['Domination_count']=0 #initialize column to zero
cols = df.columns[:-1] # select all columns but the domination_count
for i in range(len(df.index)): # loop through all the 4 columns
for j in range(len(df.index)):
if np.all(df.loc[i,cols]<=df.loc[j,cols]) and i!=j: # for every ith value check if its smaller than the jth value given that i!=j
df.loc[i,'Domination_count']+=1 #increment by 1

Related

Split Single Column(1,000 rows) into two smaller columns(500 each)

How to split a single column containing 1000 rows into chunk of two columns containing 500 rows per column in pandas.
I have a csv file that contains a single column and I need to split this into multiple columns. Below is the format in csv.
Steps I took:
I had multiple csv files containing one column with 364 rows. I concatenated them after converting them into a dataframe, but it copies the file in a linear fashion.
Code I tried
monthly_list = []
for file in ['D0_monthly.csv','c1_monthly.csv','c2_monthly.csv','c2i_monthly.csv','c3i_monthly.csv','c4i_monthly.csv','D1_monthly.csv','D2i_monthly.csv','D3i_monthly.csv','D4i_monthly.csv',
'D2j_monthly.csv','D3j_monthly.csv','D4j_monthly.csv','c2j_monthly.csv','c3j_monthly.csv','c4j_monthly.csv']:
monthly_file = pd.read_csv(file,header=None,index_col=None,skiprows=[0])
monthly_list.append(monthly_file)
monthly_all_file = pd.concat(monthly_list)
How the data is:
column1
1
2
3
.
.
364
1
2
3
.
.
364
I need to split the above column in the format shown below.
What the data should be:
column1
column2
1
1
2
2
3
3
4
4
5
5
.
.
.
.
.
.
364
364

Answer updated to work for arbitrary number of columns
You could start with number of columns or row length. For a given initial column length you could calculate one given the other. In this answer I use desired target column length - tgt_row_len.
nb_groups = 4
tgt_row_len = 5
df = pd.DataFrame({'column1': np.arange(1,tgt_row_len*nb_groups+1)})
print(df)
column1
0 1
1 2
2 3
3 4
4 5
5 6
6 7
...
17 18
18 19
19 20
Create groups in the index for the following grouping operation
df.index = df.reset_index(drop=True).index // tgt_row_len
column1
0 1
0 2
0 3
0 4
0 5
1 6
1 7
...
3 17
3 18
3 19
3 20
dfn = (
df.groupby(level=0).apply(lambda x: x['column1'].reset_index(drop=True)).T
.rename(columns = lambda x: 'col' + str(x+1)).rename_axis(None)
)
print(dfn)
col1 col2 col3 col4
0 1 6 11 16
1 2 7 12 17
2 3 8 13 18
3 4 9 14 19
4 5 10 15 20
Previous answer that handles creating two columns
This answer just shows 10 target rows as an example. That can easily be changed to 364 or 500.
A dataframe where column1 contains 2 sets of 10 rows
tgt_row_len = 10
df = pd.DataFrame({'column1': np.tile(np.arange(1,tgt_row_len+1),2)})
print(df)
column1
0 1
1 2
2 3
3 4
4 5
5 6
6 7
7 8
8 9
9 10
10 1
11 2
12 3
13 4
14 5
15 6
16 7
17 8
18 9
19 10
Move the bottom set of rows to column2
df.assign(column2=df['column1'].shift(-tgt_row_len)).iloc[:tgt_row_len].astype(int)
column1 column2
0 1 1
1 2 2
2 3 3
3 4 4
4 5 5
5 6 6
6 7 7
7 8 8
8 9 9
9 10 10

I don't know if anyone has a more efficient solution but using pd. merge on a temp column should solve your issue. Here is a quick implementation of what you could write.
csv1['temp'] = 1
csv2['temp'] = 1
new_df=pd.merge(csv1,csv2,on=["temp"])
new_df.drop("temp",axis=1)
I hope this helps!

Create multiple columns from one column (with the same data)

I have this column (similar but with a lot of more entries)
import pandas as pd
numbers = range(1,16)
sequence = []
for number in numbers:
sequence.append(number)
df = pd.DataFrame(sequence).rename(columns={0: 'sequence'})
and I want to distribute the same values into lots of more columns periodically (and automatically) to get something like this (but with a bunch of values)
Thanks

Use reshape with 5 for number of new rows, -1 is for count automatically number of columns:
numbers = range(1,16)
df = pd.DataFrame(np.array(numbers).reshape(-1, 5).T)
print (df)
0 1 2
0 1 6 11
1 2 7 12
2 3 8 13
3 4 9 14
4 5 10 15
If length of values in range cannot be filled to N rows here is possible solution:
L = range(1,22)
N = 5
filled = 0
arr = np.full(((len(L) - 1)//N + 1)*N, filled)
arr[:len(L)] = L
df = pd.DataFrame(arr.reshape((-1, N)).T)
print(df)
0 1 2 3 4
0 1 6 11 16 21
1 2 7 12 17 0
2 3 8 13 18 0
3 4 9 14 19 0
4 5 10 15 20 0

Use pandas.Series.values.reshape and desired rows and columns
pd.DataFrame(df.sequence.values.reshape(5, -1))

If you like to reshape after reading dataframe then
df = pd.DataFrame(df.to_numpy().reshape(5,-1))

num_cols = 3
result = pd.DataFrame(df.sequence.to_numpy().reshape(-1, num_cols, order="F"))
for a given number of columns, e.g., 3 here, reshapes df.sequence to (total_number_of_values / num_cols, num_cols) where first shape is inferred with -1. The Fortran order matches the structure so that numbers are "going down first",
to get
>>> result
0 1 2
0 1 6 11
1 2 7 12
2 3 8 13
3 4 9 14
4 5 10 15
If num_cols = 5, then
>>> result
0 1 2 3 4
0 1 4 7 10 13
1 2 5 8 11 14
2 3 6 9 12 15

Conditional cumcount of values in second column

I want to fill numbers in column flag, based on the value in column KEY.
Instead of using cumcount() to fill incremental numbers, I want to fill same number for every two rows if the value in column KEY stays same.
If the value in column KEY changes, the number filled changes also.
Here is the example, df1 is what I want from df0.
df0 = pd.DataFrame({'KEY':['0','0','0','0','1','1','1','2','2','2','2','2','3','3','3','3','3','3','4','5','6']})
df1 = pd.DataFrame({'KEY':['0','0','0','0','1','1','1','2','2','2','2','2','3','3','3','3','3','3','4','5','6'],
'flag':['0','0','1','1','2','2','3','4','4','5','5','6','7','7','8','8','9','9','10','11','12']})

You want to get the cumcount and add one. Then use %2 to differentiate between odd or even rows. Then, take the cumulative sum and subtract 1 to start counting from zero.
You can use:
df0['flag'] = ((df0.groupby('KEY').cumcount() + 1) % 2).cumsum() - 1
df0
Out[1]:
KEY flag
0 0 0
1 0 0
2 0 1
3 0 1
4 1 2
5 1 2
6 1 3
7 2 4
8 2 4
9 2 5
10 2 5
11 2 6
12 3 7
13 3 7
14 3 8
15 3 8
16 3 9
17 3 9
18 4 10
19 5 11
20 6 12

Index of matching rows in Pandas DataFrame [Python]

I have two Pandas DataFrames (A and B) with 2 columns and different number of rows.
They used to be numpy 2D matrices and they both contain integer values.
Is there any way to retrieve the indices of matching rows between those two?
I've been trying isin() or query() or merge(), without success.
This is actually a follow-up to a previous question: I'm trying with pandas dataframes since the original matrices are rather huge.
The desired output, if possible, should be an array (or list) containing in i-th position the row index in B for the i-th row of A. E.g an output list of [1,5,4] means that the first row of A has been found in first row of B, the second row of A has been found in fifth row in B and the third row of A has been found in forth row in B.

i would do it this way:
In [199]: df1.reset_index().merge(df2.reset_index(), on=['a','b'])
Out[199]:
index_x a b index_y
0 1 9 1 17
1 3 4 0 4
or like this:
In [211]: pd.merge(df1.reset_index(), df2.reset_index(), on=['a','b'], suffixes=['_1','_2'])
Out[211]:
index_1 a b index_2
0 1 9 1 17
1 3 4 0 4
data:
In [201]: df1
Out[201]:
a b
0 1 9
1 9 1
2 8 1
3 4 0
4 2 0
5 2 2
6 2 9
7 1 1
8 4 3
9 0 4
In [202]: df2
Out[202]:
a b
0 3 5
1 5 0
2 7 8
3 6 8
4 4 0
5 1 5
6 9 0
7 9 4
8 0 9
9 0 1
10 6 9
11 6 7
12 3 3
13 5 1
14 4 2
15 5 0
16 9 5
17 9 1
18 1 6
19 9 5

Without merging, you can use == and then look if on each row there is False.
df1 = pd.DataFrame({'a':[0,1,2,3,4],'b':[0,1,2,3,4]})
df2 = pd.DataFrame({'a':[0,1,2,3,4],'b':[2,1,2,2,4]})
test = pd.DataFrame(index = df1.index,columns = ['test'])
for row in df1.index:
if False in (df1 == df2).loc[row].values:
test.ix[row,'test'] = False
else:
test.ix[row,'test'] = True
Out[1]:
test
0 False
1 True
2 True
3 False
4 True

Pandas: process cell dependencies

I'm new to Pandas so please bear with me; I have a dataframe A:
one two three
0 1 5 9
1 2 6 10
2 3 7 11
3 4 8 12
And a dataframe B, which represents relationships between columns in A:
one two # these get mutated in place
three 1 1
one 0 0
I need to use this to multiply values in-place with values in other columns. The output should be:
one two three
0 9 45 9
1 20 60 10
2 33 77 11
3 48 96 12
So in this case I have made the adjustments for each row:
one *= three
two *= three
Is there an efficient way to use this with Pandas / Numpy?

Take a look at here
In [37]: df
Out[37]:
one two three
0 1 5 9
1 2 6 10
2 3 7 11
3 4 8 12
In [38]: df['one'] *= df['three']
In [39]: df['two'] *= df['three']
In [40]: df
Out[40]:
one two three
0 9 45 9
1 20 60 10
2 33 77 11
3 48 96 12

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Complex comparison with multiple columns simultaneously - python

Related

Split Single Column(1,000 rows) into two smaller columns(500 each)

Create multiple columns from one column (with the same data)

Conditional cumcount of values in second column

Index of matching rows in Pandas DataFrame [Python]

Pandas: process cell dependencies

Categories

Resources