Finding pairs of rows with minimum difference between two quantities - python

I have a data frame df with columns a and t, where column "a" has strings and column "t" has integers. I want to select all the row pair(s) from the dataframe, for which the values in column "a" for that row pairs are same AND the difference in values in column "t" for that pair has minimum value. For Example:
df = a t
abc 4
abc 3
def 2
abc 1
I want to get the following result:
df = a t
abc 4
abc 3
I know we can use two for loops in the same data frame but I am looking for more efficient solution.
Thanks in anticipation

You may use:
df = df.sort_values(['a', 't'], ascending=False)
diff_ = df['t']-df['t'].shift(-1)
min_idx = diff_[df['a'] == df['a'].shift(-1)].idxmin()
df.loc[min_idx:min_idx+1]
Output:
a t
0 abc 4
1 abc 3

Related

Python: merge columns with same name, keeping minimum value

I have a big matrix, like this:
df:
A A A B B ... (column names)
A 2 4 5 9 2
A 6 8 7 6 4
A 5 2 6 4 5
B 3 4 1 3 4
B 4 5 3 1 4
.
.
(row names)
I would like to merge the columns with same name, and findig the minimum value. At the end I would like to have a matrix like this:
df_min:
A B ... (column names)
A 2 2
A 6 4
A 2 4
B 1 3
B 3 1
.
.
(row names)
My intentions, afterwards (outside of the question), is to merge the rows as well. Desired outcome:
df_min:
A B ... (column names)
A 2 2
B 1 1
.
.
(row names)
I tried this:
df_min= df.groupby('df.columns, axis=1').agg(np.min)
But it didn't work, it removed some rows (for example, removing entirely row A)... EDIT: Apparently, it worked fine but I had two columns with different names but whitespace at the end of the name. These methods reorder the columns, which confused me.
A snipped of the dataframe:
Simply groupby on the level=0 for each axis:
df.groupby(level=0, axis=1).min()
output:
A B
A 2 2
A 6 4
A 2 4
B 1 3
B 3 1
both axes:
df.groupby(level=0, axis=1).min().groupby(level=0).min()
output:
A B
A 2 2
B 1 1
Alternatively, use a single groupby trough a stack/unstack:
df.stack().groupby(level=[0,1]).min().unstack()
output:
A B
A 2 2
B 1 1
EDIT
numpy only based solution
I'm assuming that you have a list associating names to column indices, e.g. for the first code sample you provided something like
column_names = ['A', 'A', 'A', 'B', 'B']
and that your data type is single-precision floating point. In this scenario, you can do something like the following:
unique_column_names = list(dict.fromkeys(column_names)) # get unique column names preserving original order
df_min = np.empty((df.shape[0], len(unique_column_names), dtype=np.float32) # allocate output array
for i, column_name in enumerate(unique_column_names): # iterate over unique column names
column_indices = [id for id in range(df.shape[1]) if column_names[id] == column_name] # extract all column indices having the same name
tmp = df[:, column_indices] # extract columns named as column_name
df_min[:, i] = np.amin(tmp, axis=1)] # take min by row and save result
Then, if you want to repeat the process by row, assuming you have another list associating row indices and names named row_names
unique_row_names = list(dict.fromkeys(row_names)) # get unique row names preserving order
df_final = np.empty((len(unique_row_names), len(unique_column_names), dtype=np.float32) # allocate final output
for j, row_name in enumerate(unique_row_names): # iterate over unique row names
row_indices = [id for id in range(df.shape[0]) if row_names[id] == row_name] # extract rows having row_name
tmp = df_min[row_indices, :] # extract rows named as row_name from the column-reduced matrix
df_final[j, :] = np.amin(tmp, axis=0) # take min by column and save result
The column-name and row-name association list for the final output are unique_column_names and unique_row_names

Pandas: select column with most unique values

I have a pandas DataFrame and want to find select the column with the most unique values.
I already filtered the unique values with nunique(). How can I now choose the column with the highest nunique()?
This is my code so far:
numeric_columns = df.select_dtypes(include = (int or float))
unique = []
for column in numeric_columns:
unique.append(numeric_columns[column].nunique())
I later need to filter all the columns of my dataframe depending on this column(most uniques)
Use DataFrame.select_dtypes with np.number, then get DataFrame.nunique with column by maximal value by Series.idxmax:
df = pd.DataFrame({'a':[1,2,3,4],'b':[1,2,2,2], 'c':list('abcd')})
print (df)
a b c
0 1 1 a
1 2 2 b
2 3 2 c
3 4 2 d
numeric = df.select_dtypes(include = np.number)
nu = numeric.nunique().idxmax()
print (nu)
a

Get only matching rows for groups in Pandas groupby

I have the following df:
d = {"Col1":['a','d','b','c','a','d','b','c'],
"Col2":['x','y','x','z','x','y','z','y'],
"Col3":['n','m','m','l','m','m','l','l'],
"Col4":[1,4,2,2,1,4,2,2]}
df = pd.DataFrame(d)
When I groupby on three fields, I get the result:
gb = df.groupby(['Col1', 'Col2', 'Col3'])['Col4'].agg(['sum', 'mean'])
How can I extract only the groups and rows where a row of a group matches with at least one other row of another group on grouped columns. Please see the picture below, I want to get the highlighted rows
I want to get the rows in red on the basis of the ones in Blue and Black which match eachother
Apologies if my statement is ambiguous. Any help would be appreciated
You can reset_index then use duplicated and boolean index filter your dataframe:
gb = gb.reset_index()
gb[gb.duplicated(subset=['Col2','Col3'], keep=False)]
Output:
Col1 Col2 Col3 sum mean
0 a x m 1 1
2 b x m 2 2
3 b z l 2 2
5 c z l 2 2
Make a table with all allowed combinations and then inner join it with this dataframe.

Pivot Pandas Dataframe with Duplicates using Masking

A non-indexed df contains rows of gene, a cell that contains a mutation in that gene, and the type of mutation in that gene:
df = pd.DataFrame({'gene': ['one','one','one','two','two','two','three'],
'cell': ['A', 'A', 'C', 'A', 'B', 'C','A'],
'mutation': ['frameshift', 'missense', 'nonsense', '3UTR', '3UTR', '3UTR', '3UTR']})
df:
cell gene mutation
0 A one frameshift
1 A one missense
2 C one nonsense
3 A two 3UTR
4 B two 3UTR
5 C two 3UTR
6 A three 3UTR
I'd like to pivot this df so I can index by gene and set columns to cells. The trouble is that there can be multiple entries per cell: there can be multiple mutations in any one gene in a given cell (cell A has two different mutations in gene One). So when I run:
df.pivot_table(index='gene', columns='cell', values='mutation')
this happens:
DataError: No numeric types to aggregate
I'd like to use masking to perform the pivot while capturing the presence of at least one mutation:
A B C
gene
one 1 1 1
two 0 1 0
three 1 1 0
Solution with drop_duplicates and pivot_table:
df = df.drop_duplicates(['cell','gene'])
.pivot_table(index='gene',
columns='cell',
values='mutation',
aggfunc=len,
fill_value=0)
print (df)
cell A B C
gene
one 1 0 1
three 1 0 0
two 1 1 1
Another solution with drop_duplicates, groupby with aggregate size and last reshape by unstack:
df = df.drop_duplicates(['cell','gene'])
.groupby(['cell', 'gene'])
.size()
.unstack(0, fill_value=0)
print (df)
cell A B C
gene
one 1 0 1
three 1 0 0
two 1 1 1
The error message is not what is produced when you run pivot_table. You can have multiple values in the index for pivot_table. I don't believe this is true for the pivot method. You can however fix your problem by changing the aggregation to something that works on strings as opposed to numerics. Most aggregation functions operate on numeric columns and the code you wrote above would produce an error relating to the data type of the column not an index error.
df.pivot_table(index='gene',
columns='cell',
values='mutation',
aggfunc='count', fill_value=0)
If you only want 1 value per cell you can do a groupby and aggregate everything to 1 and then unstack a level.
df.groupby(['cell', 'gene']).agg(lambda x: 1).unstack(fill_value=0)

How to compare column values of pandas groupby object and summarize them in a new column row

I have the following problem: I want to create a column in a dataframe summarizing all values in a row. Then I want to compare the rows of that column to create a single row containg all the values from all columns, but so that each value is only present a single time. As example: I have the following data frame
df1:
Column1 Column2
0 a 1,2,3
1 a 1,4,5
2 b 7,1,5
3 c 8,9
4 b 7,3,5
the desired output would now be:
df1_new:
Column1 Column2
0 a 1,2,3,4,5
1 b 1,3,5,7
2 c 8,9
What I am currently trying is result = df1.groupby('Column1'), but then I don't know how to compare the values in the rows of the grouped objects and then write them to the new column and removing the duplicates. I read through the pandas documentation of Group By: split-apply-combine but could not figure out a way to do it. I also wonder if, once I have my desired output, there is a way to check in how many of the lines in the grouped object each value in Column2 of df1_new appeared. Any help on this would be greatly appreciated!
A method by which you can do this would be to apply a function on the grouped DataFrame.
This function would first convert the series (for each group) to a list, and then in the list split each string using , and then chain the complete list into a single list using itertools.chain.from_iterable and then convert that to set so that only unique values are left and then sort it and then convert back to string using str.join . Example -
from itertools import chain
def applyfunc(x):
ch = chain.from_iterable(y.split(',') for y in x.tolist())
return ','.join(sorted(set(ch)))
df1_new = df1.groupby('Column1')['Column2'].apply(func1).reset_index()
Demo -
In [46]: df
Out[46]:
Column1 Column2
0 a 1,2,3
1 a 1,4,5
2 b 7,1,5
3 c 8,9
4 b 7,3,5
In [47]: from itertools import chain
In [48]: def applyfunc(x):
....: ch = chain.from_iterable(y.split(',') for y in x.tolist())
....: return ','.join(sorted(set(ch)))
....:
In [49]: df.groupby('Column1')['Column2'].apply(func1).reset_index()
Out[49]:
Column1 Column2
0 a 1,2,3,4,5
1 b 1,3,5,7
2 c 8,9
What about this:
df1
Column1 Column2
0 a 1,2,3
1 a 1,4,5
2 b 7,1,5
3 c 8,9
4 b 7,3,5
df1.groupby('Column1').\
agg(lambda x: ','.join(x).split(','))['Column2'].\
apply(lambda x: ','.join(np.unique(x))).reset_index()
Column1 Column2
0 a 1,2,3,4,5
1 b 1,3,5,7
2 c 8,9

Categories

Resources