Producing every combination of columns from one pandas dataframe in python - python

I'd like to take a dataframe and visualize how useful each column is in a k-neighbors analysis so I was wondering if there was a way to loop through dropping columns and analyzing the dataframe in order to produce an accuracy for every single combination of columns. I'm really not sure if there are some functions in pandas that I'm not aware of that could make this easier or how to loop through the dataframe to produce every combination of the original dataframe. If I have not explained it well I will try and create a diagram.
a | b | c | | labels |
1 | 2 | 3 | | 0 |
5 | 6 | 7 | | 1 |
The dataframe above would produce something like this after being run through the splitting and k-neighbors function:
a & b = 43%
a & c = 56%
b & c = 78%
a & b & c = 95%

import itertools
min_size = 2
max_size = df.shape[1]
column_subsets = itertools.chain(*map(lambda x: itertools.combinations(df.columns, x), range(min_size,max_size+1)))
for column_subset in column_subsets:
foo(df[list(column_subset)])
where df is your dataframe and foo is whatever kNA you're doing. Although you said "all combinations", I put min_size at 2 since your example has only size >= 2. And these are more precisely referred to as "subsets" rather than "combinations".

Related

How to check all the columns of a dataframe are the same?

What is a quick way to check if all the columns in a pandas dataframe are the same?
E.g. I have a dataframe with the columns a,b,c below, and I need to check that the columns are all the same, i.e. that a = b = c
+---+---+---+
| a | b | c |
+---+---+---+
| 5 | 5 | 5 |
| 7 | 7 | 7 |
| 9 | 9 | 9 |
+---+---+---+
I had thought of using apply to iterate over all the rows, but I am afraid it might be inefficient as it would be a non-vectorised loop.
I suspect looping over the columns would be quicker because I always have fewer columns than rows (a few dozen columns but hundreds of thousands of rows).
I have come up with the contraption below. I need to tidy it up and make it into a function but it works - the question is if there is a more elegant / faster way of doing it?
np.where returns zero when the items are all the same and 1 otherwise (not the opposite), so summing the output gives me the number of mismatches.
I iterate over all the columns (excluding the first), comparing them to the first.
The first output counts the matches/mismatches by column, the second by row.
If you add something like
df.iloc[3,2] = 100
after defining df, the output tells you the 3rd row of column c doesn't match
import numpy as np
import pandas as pd
df = pd.DataFrame()
x = np.arange(0,20)
df['a'] = x
df['b'] = x
df['c'] = x
df['d'] = x
#df.iloc[3,2] = 100
cols = df.columns
out = pd.DataFrame()
for c in np.arange(1, len(cols) ):
out[cols[c]] = np.where(df[cols[0]] == df[cols[c]], 0, 1)
print(out.sum(axis = 0))
print(out.sum(axis = 1))
Let's try duplicated:
(~df.T.duplicated()).sum()==1
# True
You can use df.drop_duplicates() and check for the len to be 1. This would mean that all columns are same:
In [1254]: len(df.T.drop_duplicates()) == 1
Out[1254]: True
Use DataFrame.duplicated + DataFrame.all:
df.T.duplicated(keep=False).all()
#True

Pandas: Filter DataFrameGroupBy (df.groupby) based on group aggregates

df
| a | b |
|----|---|
| 10 | 1 |
| 10 | 5 |
| 11 | 1 |
straight grouping it using
grouped = df.groupby('a')
Lets get only groups where
selector = grouped.b.max() - grouped.b.min() >= 3
yields
df
| a | |
|----|-------|
| 10 | True |
| 11 | False |
My questions is, what is the equivalent for df = df.loc[<filter condition>] when working with DataFrameGroupBy elements?
grouped.filter(..) returns a DataFrame.
Is there a way to preserve the groups, while filtering based on .aggreate() functions? Thanks!
You can use np.ptp (peak-to-peak)
df.groupby('a').b.agg(np.ptp) > 3
a
10 True
11 False
Name: b, dtype: bool
For the df.loc[] equivalent question, you can just do:
df=df.set_index('a')\
.loc[df.groupby('a').b.agg(np.ptp).gt(3)]\
.reset_index()
Alternatively (inner join solution):
selector=df.groupby('a').b.agg(np.ptp).gt(3)
selector=selector.loc[selector]
df=df.merge(selector, on='a', suffixes=["", "_dropme"])
df=df.loc[:, filter(lambda col: "_dropme" not in col, df.columns)]
Outputs:
a b
0 10 1
1 10 5
PS +1 #rafaelc - for the .ptp thing
Sadly I did not came around a direct solution.. So I worked around it like this using 2 groupby:
# Build True/False Series for filter criteria
selector = df.groupby('a').b.agg(np.ptp) > 3
# Only select those 'a' which have True in filter criteria
selector = selector.loc[selector == True]
# Re-Create groups of 'a' with the filter criteria in place
# Only those groups for 'a' will be created, where the MAX-MIN of 'b' are > 3.
grouped = df.loc[df['a'].isin(selector.index)].groupby('a')

How to select a dataframe column based on a list?

I want to make a condition based on each value unique of a data frame column using python .
I tried to put it in a list and iterate to take all the value :
**f=(df['Technical family'].unique())
for i in f:
df_2 = df[(df['Technical family'] = f[i])]
S=pd.crosstab(df_2['PFG | ID'],df_2['Comp. | Family'])
**
but apparently the df_2 = df[(df['Technical family'] = f[i])] doesn't work !
Anyone have an idea how to do it ?
You need to use == instead of =. == is for comparing values while = is for assigning values.
For example:
df = pd.DataFrame({'Technical family':np.random.choice(['1','2','3'],100),
'PFG | ID':np.random.choice(['A','B','C'],100),
'Comp. | Family':np.random.choice(['a','b','c'],100)})
f=(df['Technical family'].unique())
df_2 = df[df['Technical family']==f[0]]
pd.crosstab(df_2['PFG | ID'],df_2['Comp. | Family'])
Comp. | Family a b c
PFG | ID
A 5 5 3
B 3 5 3
C 4 3 4
Also as a further suggestion, you can directly crosstab:
res = pd.crosstab([df['Technical family'],df['PFG | ID']],df['Comp. | Family'])
res.loc['1']

Merging pandas dataframes based on nearest value(s)

I have two dataframes, say A and B, that have some columns named attr1, attr2, attrN.
I have a certain distance measure, and I would like to merge the dataframes, such that each row in A is merged with the row in B that has the shortest distance between attributes. Note that rows in B can be repeated when merging.
For example (with one attribute to keep things simple), merging these two tables using absolute difference distance |A.attr1 - B.att1|
A | attr1 B | attr1
0 | 10 0 | 15
1 | 20 1 | 27
2 | 30 2 | 80
should yield the following merged table
M | attr1_A attr1_B
0 | 10 15
1 | 20 15
2 | 30 27
My current way of doing this is slow and is based on comparing each row of A with each row of B, but code is also not clear because I have to preserve indices for merging and I am not satisfied at all, but I cannot come up with a better solution.
How can I perform the merge as above using pandas? Are there any convenience methods or functions that can be helpful here?
EDIT: Just to clarify, in the dataframes there are also other columns which are not used in the distance calculation, but have to be merged as well.
One way you could do it as follows:
A = pd.DataFrame({'attr1':[10,20,30]})
B = pd.DataFrame({'attr1':[15,15,27]})
Use a cross join to get all combinations
Update for 1.2+ pandas use how='cross'
merge_AB = A.merge(B, how='cross', suffixes = ('_A', '_B'))
Older pandas version use psuedo key...
A = A.assign(key=1)
B = B.assign(key=1)
merged_AB =pd.merge(A,B, on='key',suffixes=('_A','_B'))
Now let's find the min distances in merged_AB
M = merged_AB.groupby('attr1_A').apply(lambda x:abs(x['attr1_A']-x['attr1_B'])==abs(x['attr1_A']-x['attr1_B']).min())
merged_AB[M.values].drop_duplicates().drop('key',axis=1)
Output:
attr1_A attr1_B
0 10 15
3 20 15
8 30 27

Pandas transform aggregated clolumns with factors to rows

I have a data frame like this:
a1xbxc,a2xbxc
show 1 2
where a,b,c are attributes that can have different values e.g. a1,a2 for a and so on. Now the way I have this df is not good for plotting barcharts. I want to have it in a normal way like this:
show factor a | factor b | value
a1 | b | 1
a2 | b | 2
How would I go to achieve this? I know I should somehow split each header by ("x") and then find out to which factor it belongs and then write it into a new row, but it seems somewhat that there must me some easy way to do this in pandas.
Any ideas?
Try this:
df.columns = pd.MultiIndex.from_tuples(df.columns.str.split('x').to_series().apply(tuple))
df.stack([0, 1])
c
show a1 b 1
a2 b 2

Categories

Resources